The National Academies: Advisers to the Nation on Science, Engineering, and Medicine
NATIONAL ACADEMY OF SCIENCES NATIONAL ACADEMY OF ENGINEERING INSTITUTE OF MEDICINE NATIONAL RESEARCH COUNCIL
Current Operating Status
CORE HOMEPAGE

ABOUT CORE

FOCUS OF CORE

CORE MEETINGS, WORKSHOPS & PRODUCTS

RELATED NRC EFFORTS


Workshop on Understanding and Promoting Knowledge Accumulation in Education:

Tools and Strategies for Education Research

Day 1 – June 30, 2003

Remarks by Dr David Grissmer

DR. DAVID GRISSMER: I am going to use my 15 minutes today to talk in two contexts. One is, I think, my assigned topic, which was not as specific as I got today, which is my use of NAEP data.

The second one I think I would be remiss not to report some evidence that is being gathered from kind of parallel study which is going on which I am trying to guide, which deals with developmental research in children from the time they are conceived until the time they enter the labor force. It is looking at social, cognitive and emotional development, and the federal government spends about $2.5 million on that kind of research.

Then, the value of offering some observations on that today is that education research constitutes a small, but significant portion of that total research, and sometimes it is comforting to know that you are not alone in the troubles that you are having, and sometimes it is also useful to know that there is help that comes from outside of educational research. So I will offer a few observations on that.

With respect to the NAEP data, I have been using NAEP data for about eight years now. My hair was brown before I started using NAEP data. (Laughter).

NAEP data is absolutely unique in several ways, as Mike Nettles has indicated earlier. It is the only data that can track students from 1970 that has any reliability. SAT data simply does not have the statistical characteristics that allows it to really reliably track random samples of the population.

It’s 9-, 13- and 17-year-olds you get. You get reading, math and science, plus writing, plus - as time goes on - science, art. It is absolutely a unique data resource, and despite the travels of the Department of Education over time from being close to going out of business to funding cuts, I find it absolutely remarkable that the data is such good quality. I attribute part of that to ETS, certainly in the post-‘84 era, but even before that, the data has what I consider to be unusually good characteristics.

May definitely have a limited number of research questions it can address. There are lots of other data bases that are longitudinal that have larger samples and that have other characteristics that make it preferable to answer a good many research questions, but on certain questions, NAEP is the only game in town.

Observation on the Jenks and Phelps(?) book on the black/white score gap, I think there were like 10 or 11 chapters, and I think five or six of them use NAEF data to make their point. We would know virtually nothing about the reduction in black/white score gap or a lot of its characteristics were it not for NAEF data.

However, since NAEF is not longitudinal, the family characteristics are relatively poor, not - I mean, relatively speaking, poor collected with NAEF. You have to subdue a lot of supplementation of them, and because it is a difficult database to work with, I think it is basically underutilized.

What makes NAEF difficult, I think was mentioned by Mike as well. No student gets a complete test, so that what you get is an estimated score for a student, the so-called infamous plaza value(?), and you get five scores for students, not just one, and this has sort of serious implications for how you calculate standard errors, the reliability of the data and all of that kind of stuff. It makes it a complex database to deal with.

The other factor is that it’s got a cluster sample, which means you have to deal with design effects. Sample sizes are not huge. So in terms of multi-variable analysis, sometimes you just run out of steam with NAEF, but in terms of the central questions that No Child Left Behind is trying to address, it remains a fairly unique asset, and I think an asset that is probably not up to all of the expectations which have established from No Child Left Behind as far as measurements and things of that kind go. So that I think the - it is basically sort of underutilized; that is, there is a lot of asserts that could go on.

On the other hand, in a lot of ways, it has collected too much data. It is one of these sort of committee-designed data sets from teachers and from students that sort of - you know - every hypothesis imaginable you collect data on. It is an example of, I think, sort of bad social science, in a lot of ways, where the theories aren’t strong enough to really dictate what you should collect, so you tend to collect everything, and that’s - the National Academy Governing Board is currently trying to address some of that issue, maybe over-addressing it in some ways, by sort of trying to streamline the background on items connected with NAEF.

I have a paper on this that I wrote for NAGLI(?), which if anybody is interested is in a lot more detail, which I will be glad to share.

Let me now turn to this other study, because I have heard a lot of the same topics and issues brought up today, and I thought it would be useful to try to sort of tell you that there is another study going on which has tried to address some of the same issues, and some in a different way than we are doing here.

Before I heard and read the paper by David Cohen, which I consider to be an absolutely, you know, sort of just a revolution in the field, but all of the words I heard in the paper itself about theory building, about causative links that David and Steve are doing are just wonderful.

I think three years ago I wrote a paper that said, you know, the assumption then in this field is that if we could only get better data or better statistics that we would find what some have called, I think, the Golden Standard; that is, we’ll get a measurement so good that we would all believe it, and that’s where consensus would come from, and I think we have sort of given up on that now; that is, it is not really a matter of more data and sort of better techniques. It is really much more a matter of whether we can get replicable and consistent measurements across social science in general, and education, as a basis for forming theories.

I come out of a background of physics originally, and I have used a lot of that physics recently in terms of thinking about these problems, but until you can get consistent measurements, you cannot really build the broader theories, and one of the questions that Ellen asked this morning about, you know, this is a daunting task, it is the role of theory to cut down the amount of data, the number of the research you need to do. Without theory, you collect everything. With theory, you can really design very specific next experiments which test the hypothesis you want to do. So the lack of theory in this area is a really serious problem, but theory can’t be built until you have confidence in your measurements.

The field, basically, is oriented toward making more measurements. We get new databases, and we make new measurements, and we get better measurements and we develop new statistical techniques, so we try those. So we have much research, but little knowledge, basically, and I think the cause of that is that until we get serious about the business of why research results differ we can’t make progress.

Now, research results can differ for a whole number of reasons, which include different model specifications, different assumptions, a whole range of reasons from a researcher just made a mistake in a SAS coding, but until we get serious about doing research that looks at how consistent measurements are and why they differ, I think that is the starting point of sort of real investigation in the field.

I come from originally physics. You know, when you get inconsistent measurements, it creates a crisis, you know, until - you don’t go on and continue to measure until you sort of solve why there’s inconsistency, and it becomes a major turning point for a field, uncovering that inconsistency.

So I think the key here is that we need more research that looks at the consistency of measurements and why it is inconsistent, rather than another measurement, which sort of tries to measure the same thing again.

The things that we have been looking at in this study are the problems that we see in education research, sort of across much of the field, and the answer is yes, and our job was to see if there are some strategic issues which sort of can explain this wider phenomena; that is, are there a set of flawed assumptions which we are all making in our analysis which might explain this kind of inconsistency, and one of the nice things about this panel is is that it contains people from doing brain research, genetic research all the way over to sort of the social science, and, therefore, you have to read the stuff that is going on over there, and I think, to summarize, the main thing that we are sort of on to without stating it as a - you know - finding is that the things we are learning in the area of genetics and brain research really are inconsistent with the assumptions we are making in the broader area of social science basically, and until we sort of get those two on the same set of assumptions, we are probably always going to have inconsistency of measurements; that is, it is really a set of fundamental assumptions that run across most of the work that all of us do, and, in particular - go ahead.

If you look at the work that is being done in areas of genetics, in areas of brain research, in areas of behavioral genetics, for instance, there is just absolutely compelling evidence that biology affects nearly - a significant part - a “significant part,” I mean a part that - enough to make other measurements biased - affects a significant part of the - whether it is characteristics, behavior, outcomes for children.

Now, this has nothing to do with race, which has sort of tarred the whole thing. This is simply to do with inter-race sort of differences, but until you can explain why twins that are identical have a correlation of .9 on most measures and fraternal twins have .5, and less related have .25, there simply - you know - is - you have to assume that there is a biological mechanism that sort of creates differences among kids, and, amazingly, what I have found is that when you talk to parents or kids that have grandparents and you’re asking these assumptions and they say, you know, “Do your kids have inborn differences from the time they are born?” the answer always universally is yes; that is, parents and clinicians understand this much better than researchers do, and research has been slow to coming to this thing, but the more problematical and complex thing is is that most of the influence, whether it is biology or environment, seems to occur interactively; that is, not independently, and that is a really significant research problem.

Let me give you the mechanisms that - the ways in which this happens. First of all, brain research, basically, tells us that a lot of the brain capability that is sort of genetically programmed doesn’t come on line unless there is an environmental stimuli there to make it happen, so that it is the interaction of these two that allow something to develop, but the other parts of it are that our sort of biological characteristics actually help to determine what we choose in terms of environments. If you are a better reader, you might more likely go to a book store than if you’re not a better reader. If you have attention deficit, you are more likely to be in risky behavior than if you are not. So there are biological tendencies help determine the environment that we get in, one.

Two, the way people react to us are partly our biology, and if you are a parent, you understand that as well. You don’t parent your kids in the same way, usually, because you are aware of their differences.

And I think the third thing is that even if kids are in the same environment, some of their inbred characteristics make them react differently to the same environment. This is a hypothesis at this point, but the compelling - the research from brain research and genetics and that area really paint a quite compelling picture that this is the way our outcomes that we are all trying to study happen, and if that is true, it really sort of, I think, has some major implications for the way in which we do research.

I think that covered the first one.

In terms of data collection, one of the implications is is that we have to include samples of related individuals in most of our samples, which we have not done; that is, unless we include twins, unless we include siblings, we can’t sort of have the genetic controls we need to disentangle these effects, and so one of the possible implications is that we are going to have to - you know - much more widely stratify on measures that we currently have never stratified on in terms of including individuals in most of our major surveys.

A second kind of thing that comes out is that, as you know, even in environmental interactional effects, they are tough to measure. Interaction terms take bigger samples. They are tougher to measure, things that appear to be interactive in almost even the environmental area, let alone the biological area, and so I think it means that we have to investigate sort of a new range of phenomena. I mean, the good news is that some of this can explain inconsistent measurements, I think, quite widely. Bad news is it is much tougher research to try to do and more demanding of data and researchers.

Finally, experiments are being seen as a good thing, maybe too good a thing these days, but the implications in terms of the effects we measure - you know, most of the things we measure turn out to have what most of us consider to be smaller, disappointing effects; that is, we never find a sort of silver-bullet kind of - kind of effect. You know, the most recent one was after-school - you know - programs.

If we really believe that most of this stuff is interactive, we should always expect small effects from most of our interventions, small, average effects, and the reason is is that kids are so different biologically that any one-form-fits-all solution is going to create a whole range of effects from zero or maybe even negative on up. So average effects of experiments can be quite misleading. They can be quite effective for a sub-group of kids, and so it is the differential effects which are really important, and that also means a different way of experimentation. It means that experimentation should go on continuously where you continually improve the design and test the next phase from the next set of kids and so forth, so that experimentation is not a one-shot, let’s-measure-an-average-effect. I think that’s always going to be small, if this evidence is right. It really demands a different kind of research center, a different kind of funding for things that allow more continuity over time in the design.

The other thing that I think is called for is that most of our experiments are program evaluation, rather than testing theory. If we can get theories, we don’t need program evaluations, because we can predict the effects of programs. So more experiments should be done that allow us to test theories, rather than simply measure programs.

Programs - effectiveness measures have very limited applicability, because if you change their context a little, the results change or if you try to implement it more widely context or - you know, results change. So it is a very fragile measurement, and we need more robust measurements along this line.

Finally, one of the interesting things here is that, in this field, and all over, there is this huge gap between clinicians and researchers. We don’t talk to each other. We don’t respect each other, all kinds of things, and I think there is a reason for that. The reason is, I think, is that clinicians have always implicitly been facing this more complex environment; that is, they have to design individual interventions for kids, given what they know about the kids and all of their characteristics, and that is a tough job. We have been so far away from it that we have taken the easy road of not having to deal with all that complexity, and so I think one of the possible implications of this is that we need much more clinician input into both design and evaluation of experiments, because how do you know why this experiment works for this set of kids and not this set? I would vouch that not many researchers could answer that question, but I would vouch that a lot of clinicians could give you a lot of good hypothesis about why that might be true.

Now, I don’t want to push these. These are sort of - we are at the point of suggestive things that might explain this pattern of results across the field. We got a lot more study to do, but I thought it would be useful sort of at least share some of the thinking that is going on in this context and say that, one, educational research is not alone, and, two, I think the whole field needs to pay more attention to what we are learning over here in the area of brain research, genetics, behavioral genetics, because that is where our basic assumptions come from when we do our analysis.

Thank you.

RSS News Feed | Subscribe to e-newsletters | Feedback | Back to Top