|
Workshop on Understanding and Promoting Knowledge Accumulation in Education:
Tools and Strategies for Education Research
Day 1 – June 30, 2003
Remarks by Dr Norman Bradburn
DR. NORMAN BRADBURN: It’s always the enviable position, you know, speaking last.
I have a few slides that I use later, because then it will really be near the last of the day - going to need all the help you can get.
I am just going to talk about confidentiality, and Marilyn has already mentioned some of the things, but I will perhaps elaborate on some of them.
Let me start off by - what I see the main issues are of confidentiality, and I think there are three. One is who has access to the data. What are the threats to confidentiality? Why are we worried about confidentiality? And the third, what are the techniques for protecting confidentiality?
So I need to start with the first, who has access to the data? What do we mean when we say we are holding data confidentiality?
Well, I think we mean that there is a restricted set of people who have access to it, and those people have, in some way or other, promised to obey the confidentiality rule. Typically, in an organization, that is people who are employed by the organization and have signed confidentiality agreements of some sort, either - mostly now explicitly or certainly implicitly.
And that is - in big organizations, that is relatively easy(?). In a university setting it’ll be the research team - the PI and his researchers, research assistants and so forth, but then my question is, well, why can that what I call the “umbrella of confidentiality” and so forth be extended? And that is one of the real nitty-gritty questions about extending this, because the question is in sharing data what are we doing?
Well, one of the things we are doing when we share data with others who weren’t part of the primary research team is to extend the umbrella of confidentiality to this group, and then you can - later on, you say, “Well, how did you do that and retain some kind of feeling that the problems that you have made or the responsibilities you have about confidentiality are being fulfilled by this other group over which you have virtually no control?”
And, ultimately, it is a matter of trust, but you can do some things to bolster it, but, ultimately, you really trust the people you give the data to to maintain the same level of confidentiality or even stricter, perhaps, than you do.
Now, one of the problems why that is a problem is that one of the reasons people might want your data is because they want to add some more data of their own to it, which makes, now, a different data set which has different confidentiality problems, because you have now added - say you added - somebody has access to some geographic information. Well, adding geographic information to a person file makes the confidentiality issue just much, much, much more difficult preserving it. So there is this question of what - not only how widely you’re sharing it, but what are they going to do with it, in terms of adding extra data, which may mean that the confidentiality protection issues - that you may have to do other things to preserve confidentiality. So that is big issue.
And then there are public-use files which are essentially files that are out there in which there is something built into them that you feel protects the confidentiality, because they are not just going to be out for anybody to use, and, presumably, they’re - at one level you are saying that the public hasn’t signed any confidentiality agreements - So whenever the threat of confidentiality has to be taken care of in the preparation of the public-use file, and, as Marilyn just mentioned, sometimes what you have to do to data to come to that status actually makes the data virtually worthless. So some data just - it doesn’t make sense to have public-use files, because you can’t mask the data. So those are, I think, the kind of issues about who has access.
Now, what are the threats to confidentiality? Where do we have to look at for?
Well, actually, I think the kind of serious threats to confidentiality are the ones that rarely get talked about in meetings like this, because most of the kind of conferences I have been in and workshops I’ve been on confidentiality are all made up of researchers, and they are all good, honest people, and they know they would never - you know - on purpose anyway, reveal - and just so people who are worried about how inadvertently one might breech confidentiality, and so that is what a lot of the kind of academic issues have to do with protecting even inadvertent disclosure, but, in fact, I think the major real threats to confidentiality are in a completely different realm.
First of all, there is law enforcement. One of the things which - I don’t know what the status of the NCS one is now, but I think in the Patriot Act it said that the government could get access to names and data of individuals even in educational statistics of various sorts.
Basically, the only two places I know that are virtually - and maybe actually - immune from law enforcement getting access to data is the Census Bureau, which is protected by Title 13 and seems to have a very good record of actually making good on that, and certain data that are collected under a PHS - I always forget the title – 13-B or something like that -- which has what is called a “shield law” which protects the investigators from having that data subpoenaed - other than that, none of us, that I know of, have any legal protection.
Now, mostly a judge can issue a subpoena and the data can actually be subpoenaed for various sorts - They might say, “Well, why would we want to do that?” Well, where there are law-enforcement issues - that is, where people want - are investigating, say, Social Security fraud or welfare fraud and we have done a study of welfare recipients and so the government investigators want access to the data, because of - to find out about welfare fraud. So most of the times, those issues, at least in my experience when I have had to confront those, which I have done three or four times in my career, most of the time, we have gotten the enforcement people to back off on the grounds that it is bad public policy; that is, they are going to subvert the very purpose of doing the study in the first place, which was maybe to help about estimating, for example, how much income-tax cheating there is or how much Welfare fraud there is, that sort of thing. So every case that I have confronted that have finally - though it took a lot of effort - the times to get the law-enforcement people to back off.
A more difficult issue, actually, has to do with private suits or class-action suits. One of the cases that I was involved in, which I - in a different hat - and did lose, actually, was a class-action suit brought by these people who had been treated with a particular drug in an experimental paradigm in a study some years ago, and the - it happened that this study was one of the few places where there were good records about who had actually taken the drug and had been followed up, so forth.
Consequently, the plaintiffs in the class-action suit subpoenaed the research records, which had the individual data on treatment and the health follow-ups and so forth with the individuals. That was a case where the promise of confidentiality simply couldn’t be done, because, luckily - we could have gone to jail, I guess, but it wasn’t my study, but the PI could have gone to jail, but so far I only know of one case of - a guy has gone to jail in order to protect the confidentiality of his sources, but most of us are not that - you know, we’ll do a lot, but -
Another one that is possibly increasingly - is Freedom of Information Act issues. This is part of the regulation. I’ve forgotten - I always get these numbers wrong, but that OMB put out about what - for data sets that are important in making public policy, particularly regulation. You can get the data under the Freedom of Information Act, if it is a grant. Contracts - always have a problem, but previously to this regulation, the grants were protected from that, but, now, even if you have a grant.
Okay. The other is just - is computer matching is another problem.
So those are the kinds of things that one needs to think about with regard to confidentiality, and these are the ones, although they don’t come up very often, when they do come up, I think often investigators are unprepared for how to deal with them.
I’m going to turn now to techniques for protecting confidentiality. There are two strategies - two big strategies. One is to restrict the users, and the other is to alter the data. Those are the major kinds of things like that.
So let’s look at the next slide.
The restriction of use comes in what I call a strong form and a weak form.
The strong form are data enclaves. The NSF supports a number of research data centers around the country, mostly at universities or sometimes at the Federal Reserve Bank, but with university participation, in which micro-Census data - that is, the actual Census data, protected by Title 13 - is available to researchers who have demonstrated - by submitting a research proposal that the kind of data they will - the research questions they will do will be a benefit to the government. They can get actual access to the micro-data in the Census data, and, more importantly, or at least what worries(?) users, they can add other data. So they get the Census micro-data and they can add data from the employer files. They can add Social Security data. They can add tax data. They can merge files in a way, because they have access to the individual names, they can get other data - This permits all kinds of research, which would be absolutely impossible otherwise. Very, very, very strict controls over it, and there are sealed rooms and then nothing is on the network. Can’t do anything remotely and so forth. So it is cumbersome for various sorts. Some other statistical agencies have talked about this kind of thing. I think personally that that is probably the way that we are going, because of some things about sharing data that have become problems.
The weak form - weak in the sense that it is not as strong as this other one - is the licensing procedures that NCS uses, which, personally, I think are the best model for handling this issue, but, so far - my sense is because Title 13 can’t do it, I am hoping other agencies will do it. This is what NSF does for the science statistics that we are responsible to.
Okay. The other alternatives are altering the data. The weak form of altering the data - well, there are several different ways with the weak form. One is just top coding. That is just collapsing the top codes where you have incomes, and where you have too few people, essentially, to - not to protect confidentiality or collapsing categories altogether so that you have a minimum size for analyses. These are the things that are done often for preparation of public use tapes(?).
Another one which is used to protect confidentiality is dividing the files with some of that - the identification data is held in a separate file from the data, the individual data and so forth, but - and for - if you want to protect it, this is used all - you know, protect it against subpoena, then - keeping the main file in Canada where it is not subpoenable or some other country that’s not subpoenable, and so the - you have to - there is somebody there - an archive group or something - in Canada which is the only place that has the numbers that will be able to put it together. That is used only, I think, for highly-sensitive data that you wouldn’t want to - even the courts to get.
Finally, the straw(?) form, there are two versions of the straw form, one that is - just adding random noise to the data, which just makes the data more unreliable, because you have just sprinkled, just changed the values that are present, which won’t have any effect on the means, but will effect the reliability of lots of things, and, obviously, people worry sort of about that, because it makes the - it reduces the - you know - defined(?) effects.
The one that is getting a lot of attention recently, that Marilyn mentioned briefly, at the data-research centers, and this is actually creating synthetic data sets which have all of the statistical properties of the original data set, but have entirely false data - made-up data, so that you cannot break confidentiality because, in fact, any data set, any data record you have is a synthetic data record.
Now, you might say, how is that possible? This is what makes statisticians excited about how to make it possible, and it is - I think people are more and more coming to believe that it is, in fact, possible. What happens is that people can send in their sort of model. They can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and get your codes all right and get your SAS Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the results.
It is a way which I think is possibly the way of the future for lots of very, very confidential data, and maybe even because of the pressures on - because of the ability to protect confidentiality I think is being eroded by the internet and other possibilities, I think this is probably where we are going to be driven to, although, I hope not.
|