|
Workshop on Understanding and Promoting Knowledge Accumulation in Education:
Tools and Strategies for Education Research
Day 1 – June 30, 2003
Remarks by Dr Marilyn Seastrom
DR. MARILYN SEASTROM: Good afternoon. Basically, what I am going to talk to you about is NCS. We are the statistical arm within the Department of Education, and, as such, we are congressionally mandated to collect, collate and produce published data on education in the United States and other countries. That is a big order.
What that translates into in reality is we collect data from pre-K, actually, now, from the birth cohort or the early-childhood longitudinal study, the first year of life to adult education, where we are doing things like measuring and pariticpation in adult education, but we are measuring literacy of elderly adults in the country, in our national and international literacy studies, and we try to do everything in between as well.
To do this, we have all types of data arranging from universe census-type data to cross-sectional surveys, longitudinal surveys and assessments. Many people today have used examples of some of those.
One point I want to make, though, is what makes us different as a federal statistical agency collecting data is we are not like researchers to the extent that if you get a grant to do a particular project, you design your data collection to collect the data you need to test the hypothesis that you are getting funded for.
We are using the taxpayers’ dollars, so we don’t have the luxury or the constraint - depending upon your perspective - of collecting data that are quite that narrow. When we design a survey, of course, we have to have a reason, a rationale for it, and one of the things we do in the process of defining that is we talk to people like yourselves, our constituency, policymakers, to find out what their aims are, but we have to define it broadly enough, so that we really maximize the use of the taxpayers’ dollars, so that we are collecting information on specific hypotheses, but we are also collecting information on a breadth(?) of topics related to that.
I think probably the best example - I know that David talked about how broad some of the data in the National Assessment are, but I think probably a better example of that is our schools and staffing survey, which serves as a multi-purpose survey of status of elementary and secondary schools’ education in this country.
That study basically does everything except assess students. We go out in the field. We go to schools. We sample schools. We get rosters of teachers. We sample the teachers. We sample their principals. We get information from the district. We get information from the library and media centers in those schools. There is no one hypothesis that drives that study, but there is a wealth of information, including things like instructional practices, where we have been doing research for a number of years on ways to try to combine some more qualitative learnings to improve those measurements that we have within classrooms, so that we can try to go beyond simply asking a teacher how much time he spends doing X, Y or Z.
Okay. And, frequently, the information that we collect in those surveys is guided by technical review panels, and I know there are many people in the room who have served on some of those panels for us. There is at least one person here who has served on one of our advisory - our main advisory committee. So we don’t do these things in a vacuum. We seek input and we get input and advice from a lot of people.
Essential to this notion of shared common data is the point that we take as our responsibility as a federal statistical agency to provide high-quality data with descriptions and documentation that are needed for reproducibility.
I think this is an important distinction from this notion of replication we have talked about today, and there is something I would like to add to the table, which is really what you want to do when you want to be able to give people the information so they can replicate, it is really so they can reproduce. So, on one hand, you want to be able to have someone take the exact same data set, the exact same code, check the code, of course, to make sure that it is doing what it says it is going to do, and to get the same results, compare that to replication Barbara talked about where you take the different data set and you take the same hypothesis and you’re looking to see whether you get the same results with a different data set or the same data set at a different point in time. So I think maybe that distinction will help crystalize some of the conversations.
We have had at NCS written standards since 1992 that make this notion of reproducibility, clear documentation a central part of our existence, and what we - going to do.
Something that hasn’t been mentioned today, but I think it is worth mentioning, is that as of September 1 or 30th of 2002, there is a federal-government-wide requirement that all information put out by the federal government is subject to information quality guidelines. So they are required and administrated by the Office of Management and Budget. Basically, those things require utility, quality, integrity, and included in there is a notion of reproducibility.
So whether or not you know it, it is in the law and every federal agency has somewhere on its website their own description of their information quality guidelines, and there is a procedure in place, so that if people feel they are not able to get access to the information for reproducibility or they disagree with the finding and think there is an error, there is a process within each department of government where you can go to the department and protest and ask for correction of information. Just so that you know that. It is something that was put in place by the government to help improve the quality and access to data.
Another of the topics that we were asked to address today concerns the balance between protecting confidentiality, while, at the same time, maximizing data access. We spend a lot of time on this at NCS. We are impacted legally by both privacy and confidentiality laws.
On the side of the privacy laws, we have the Privacy Act of ‘74, and the Protection of People Rights Act, which refer to as PPRA. We don’t know whether it’s people, protection or protection of people.
And then, on the other side, we have confidentiality laws. We have what is now known as the Education Sciences Reform Act of 2002, and then we have a new law that basically takes the provisions that we at NCS have had since ‘88 in our law and applies most of them to all federal information that is collected as statistical data. It is part of the E-Government Act. It is confidential information, protection, Title V.
And what do these laws do? If we look first at the privacy side, the Privacy Act safeguards the individual’s personal privacy by requiring that individually-identifiable data or current, accurate and protected from misuse.
In education, under the PPRA, there are additional safeguards that require parental consent and limit the topics that students can be required - and I stress “required” - to respond to in surveys.
This is the list of topics that are specified in law. Under the law, no student shall be required to submit to a survey, analysis or evaluation that reveals information concerning this list of topics without prior written consent of the parent, and then you see the list there. It is a rather extensive list and was added to in the last administration. I think the political piece got added.
One of the issues that is under consideration in the department right now is what does “required” mean? Up until this point, the office that administers PPRA has been doing it on an ad-hoc basis, and they have reached a determination, on an ad-hoc basis, that a test that is administered in a classroom, by definition, cannot be considered voluntary. That puts us in a bind. We are also required by law to collect data that are on this list of forbidden topics. We are required to collect data on school safety, on crime safety, illegal behaviors in school. So there is a committee that has been convened. It hasn’t started meeting, but the intent of that committee is to work out these relationships and to come up with a working definition of “voluntary” versus “required” that will allow us to protect people’s rights, but, at the same time, collect the information in a voluntary environment that we need to monitor education.
On the other side, we have these confidentiality laws. What these laws do is they require federal agencies to protect, again, individual identifiable data that are collected under a pledge of statistical confidentiality. If you collect your data - and, now, under the E-Government Act, it doesn’t matter whether you are a statistical agency or not, if you collect your data for statistical purposes and you promise statistical confidentiality and you collect them and you’re part of a federal agency, then you are bound by this particular law. In addition to requiring - protection, it restricts access to this data to authorized persons and to statistical uses, and, under this law, a willful disclosure of someone’s identity is a Class C felony, and what that is, for those of you who haven’t had to deal with Class C felonies, is a $250,000 fine and up to a five-year term in prison. It gives a lot more meat to most agencies that have just been covered by the Privacy Act, which is a misdemeanor.
Okay. Because of these constraints now imposed by these laws, we have both public-use and restricted-use data files. Our public-use data files include anaonymized versions of individually-identifiable data, while the restricted files include the more detailed, individually-identifiable data that we use for our in-house analysis. The direct identifiers from persons are removed from any and all analysis files. So we are not sitting there with files in our office that have people’s names, Social Security, addresses, things like that, nor do we give those to people through our licensing system.
Our public use files are available to everyone. Many of them are on our website.
Our restricted-use files are available to qualified researchers for approved statistical purposes, subject to the terms of the license agreement, and I know there are several people in the room today who are licensees, and that is a whole different talk to go through what that whole process is.
What I want to talk about next is the ways that we provide access to our data.
What you see here is a list of the different steps that we take to provide data, and I’m just going to real briefly go through them and let you know what they are, because for those of you who are our heavy users and our licensees you may not know about some of these other tools that are there that are useful for graduate students for getting basic data.
The first group - next slide. This group of tools - you see a listing here, and they are basically a descriptive set - provide descriptive data our universe data collections that allow people to go on line and look up a school, get the basic data, type in another school for comparison or say I want to compare it with other schools that are similar in characteristic X and then get information in that way.
Our next set of tools - and this is sort of hierarchical, the amount of information that they allow the researcher, user to have.
Our next level includes tools that allow the researcher or the external user to build basic tables, but not a lot of data manipulation is involved here. We have all these tools in the data. You can build a table, and actually it’s - I don’t play with it a lot, but I looked at it before I came here to talk about it, and - some extra steps. There actually is a lot of manipulation you can do on the underlying safety(?) of that tool.
A tool that perhaps more of you are avialable(?) our main data tool. This tool is constructed basically with a set of free-run tables behind it that allow people to call those tables up and cut them in different directions, getting student scores by a bunch of different variables - basically, student scores or students’ achievement levels by virtue of anything on the data file.
You can get standard errors and you can get significant-sets results, and they just added a graphic capacity to that one.
Okay. I mentioned we have public-use files. These are for people who want to get in there and feel the data and manipulate micro-data themselves. We currently have over 100 public-use micro-data files with data from primarily the late ‘80s through the present time available on our website, and then our earlier data from the late ‘60s on is supported by NCS at the International Archives of Educational Data of ICPS - you heard about.
We had a number of data sets from the ‘60s and ‘70s that came from the era when they had cards or only had nine-track tapes that weren’t real well documented and so we engaged in a large project with ICPSR to take this data, read them in and clean them up, document them to make them available to people on an electronic format, to make them more accessible.
As many of you who use our data know, the information in the public-use file is not sufficient. In some cases, we have decided that the changes that are required to produce a public-use file would make the data useless, so we only can do a public-use file.
Okay. So there are two options that exist here. One is what we call a “Data-Analysis System,” and the other is restricted-use data.
I want to talk a little bit about the Data-Analysis System, because that is something - I haven’t heard a lot about post-secondary education, and most of them are in that field, so it is something you may not be familiar with.
Okay. With this, what happens is they are based on analysis reports that we have put out, and all the background variables and all the analysis variables in this specific report are available on the Data-Analysis System. So the data user reads a particular report. They are interested in seeing relationships with the additional other background variables that weren’t included in the report. They can go to the drafts(?). They can explore its relationships. They can get tabulated data of the standard errors that are computed correctly from the complex(?) samples and you can also get regression coefficients.
Okay. And if that is not good enough for you or if you are working on a data set that doesn’t have a data analysis system, we also have restricted-use data.
What this does is this provides full access to the data that we use internally and that our contractors use to qualified researchers who can apply and be loaned restricted-use data to use at their institutions.
And this is a site that I will note that there are a number of federal agencies, particularly under the law that I referred to. One of the things it does for other agencies is it gives them the legal capacity to share data as well. Some of them were precluded from doing that under the Privacy Act, as they didn’t have stiff enough penalties.
Many of the agencies and other major statistical agencies are doing that by using either data centers where the people - researchers are required to travel to a centralized location, go through the same sort of application process - justification for why you need the data. Then you travel there and you pay the fee to use the data or alternatively you do a - access, where you go through the same justification. You get a dummy data set to develop your programs on. You send it in, and then it is run for you. You pay for that, of course. They review it to make sure they are not giving you any confidential data, and they submit that to you.
Okay. So I just - for those of you who use our data, I just wanted you to know that you really are - you’re getting a lot more access than a lot of people are getting with federal statistical data.
Okay. My last point here is to answer the last two questions - “Why is data access important to us at NCS?”
It is a congressional mandate. So we have to do it, most important, but more importantly than that, actually, federal education data bases that are available to all researchers really can help ensure increased access to a common information base.
And, second - analyses of those data, I think, really will provide the means, as we move forward, for people having access to the same data and with the same definitions.
Thank you.
|