The National Academies: Advisers to the Nation on Science, Engineering, and Medicine
NATIONAL ACADEMY OF SCIENCES NATIONAL ACADEMY OF ENGINEERING INSTITUTE OF MEDICINE NATIONAL RESEARCH COUNCIL
Current Operating Status
CNSTAT HOMEPAGE

WHAT'S NEW

ABOUT CNSTAT

COMPLETED PROJECTS

MEETINGS

PUBLICATIONS

STAFF AND CONTACT INFORMATION

OTHER SITES OF INTEREST

LOCAL SEARCH


Current Projects

DRAFT

Measurement Error in Surveys of the Low Income Population

Nancy A. Mathiowetz
Joint Program in Survey Methodology

November, 1999

Paper prepared for
The Workshop on Data Collection for Low Income and Welfare Populations

National Academy of Science

Introduction

The measurement of the low income and welfare populations offers particular challenges with respect to reducing various sources of response error. For many of the substantive areas of interest, the behavioral experience of the welfare populations is complex, unstable, and highly variable over time. As the behavioral experience of respondents increases in complexity, so do the cognitive demands of a survey interview. Contrast the task of reporting employment and earnings for an individual continuously employed during the past calendar year with the response task of someone who has held three to four part time jobs. Other questionnaire topics may request that the respondent report sensitive, threatening, social undesirable, or perhaps, illegal behavior. From both a cognitive and social psychological perspective, there is ample opportunity for the introduction of error in the reporting of the events and behaviors of primary interest in understanding the impacts of welfare reform.

This paper provides an introduction to the various sources of measurement error and examines two theoretical frameworks for understanding the various sources of error. The empirical literature concerning the quality of responses for reports of earnings, transfer income, employment and unemployment, and sensitive behaviors is examined, so as to identify those items most likely to be subjected to response error among the welfare population. The paper concludes with suggestions for attempting to reduce the various sources of error through alternative questionnaire and survey design.

Sources of Error in the Survey Process

The various disciplines which embrace the survey method, including statistics, psychology, sociology, and economics, share a common concern with the weakness of the measurement process, the degree to which survey results deviate from "those that are the true reflections of the population" (Groves, 1989). The disciplines vary in the terminology used to describe error as well as their emphasis on understanding the impact of measurement error on analyses or the reduction of the various sources of error. The existence of these terminological differences and our desire to limited the focus of this research to measurement error, suggests that a brief commentary on the various conceptual frameworks may aid in defining our interests unambiguously.

One common conceptual framework is that of mean squared error, the sum of the variance and the square of the bias. Variance is the measure of the variable error associated with a particular implementation of a survey; inherent in the notion of variable error is the fundamental requirement of replication, whether over units of observation (sample units), questions, or interviewers. Bias, as used here, is defined as the type of error that affects all implementations of a survey design, a constant error, within a defined set of essential survey conditions (Hansen, Hurwitz, and Bershad, 1961). For example, the use of a single question to obtain total family income in the Current Population Survey (CPS) has been shown to underestimate annual income by approximately 20 percent (Bureau of the Census, 1979); this consistent underestimate would be considered the extent of the bias related to a particular question for a given survey design.

Another conceptual framework focuses on errors of observation as compared to errors of nonobservation (Kish, 1965). Errors of observation refer to the degree to which individual responses deviate from the true value for the measure of interest; as defined, they are the errors of interest for this research, to be referred to as measurement errors. Observational errors can arise from any of the elements directly engaged in the measurement process, including the questionnaire, the respondent, and the interviewer, as well as the characteristics which define the measurement process (e.g., the mode and method of data collection). Errors of nonobservation refer to errors related to the lack of measurement for some portion of the sample and can be classified as arising from three sources, coverage, nonresponse (both unit and item nonresponse), and sampling. Errors of nonobservation are the focus of other papers presented in this workshop (see, for example, Groves and Couper).

Questionnaire as Source of Measurement Error. Ideally, the question will convey the meaning of interest to the researcher. However, several linguistic, structural, and environmental factors affect the interpretation of the question by the respondent. These factors including the specific question wording, the structure of the question (open v. closed), the order in which the questions are presented, the overall topic of the questionnaire, whether the question is read by the respondent (self-administration) or presented to the respondent by an interviewer, and the mode of communication used by the interviewer (that is, telephone vs. face-to-face presentation). Question wording is often seen as one of the major problems in survey research; while one can standardize the language read by the respondent or the interviewer, standardizing the language does not imply standardization of the meaning.

One source of variation in respondent’s comprehension of survey questions is due to differences in the perceived intent or meaning of the question. Perceived intent can be shaped by the sponsorship of the survey, the overall topic of the questionnaire, or by the environment more immediate to the question of interest, such as the context of the previous question or set of questions or the specific response options associated with the question.

Respondent as Source of Measurement Error. Once the respondent comprehends the question, he or she must retrieve the relevant information from memory, make a judgement as to whether the retrieved information matches the requested information, and communicate a response. Much of the measurement error literature has focused on the retrieval stage of the question answering process, classifying the lack of reporting of an event as retrieval failure on the part of the respondent, comparing the characteristics of events which are reported to those which are not reported. Several factors have been found to be related to the quality of reporting including the length of the reference period and the salience of the information. For example, the literature suggests that the greater the length of the recall period, the greater the expected bias in the reporting of episodic information (e.g., Cannell, Fisher, and Bakker, 1965; Sudman and Bradburn,1973). Salience is hypothesized to affect the strength of the memory trace and subsequently, the effort involved in retrieving the information from long-term memory. The weaker the trace, the greater the effort needed to locate and retrieve the information.

As part of the communication of the response, the respondent must determine whether he or she wishes to reveal the information as part of the survey process. Survey instruments often ask questions about socially and personally sensitive topics. It is widely believed, and well documented, that such questions elicit patterns of underreporting (for socially undesirable behavior and attitudes) as well as overreporting (for socially desirable behaviors and attitudes).

Interviewers as Sources of Measurement Error. For interviewer-administered questionnaires, interviewers may affect the measurement processes in one of several ways, including:

  • failure to read the question as written;
  • variation in interviewers’ ability to perform the other tasks associated with interviewing, for example, probing insufficient responses, selecting appropriate respondents, recording information provided by the respondent; and
  • demographic and socioeconomic characteristics as well as voice characteristics which influence the behavior and responses provided by the respondent.
  • The first two factors contribute to measurement error from a cognitive or psycholinguistic perspective in that different respondents are exposed to different stimuli; thus variation in responses is, in part, a function of the variation in stimuli. All three factors suggest that interviewer effects contribute via an increase in variable error across interviewers. If all interviewers erred in the same direction (or their characteristics resulted in errors of the same direction and magnitude), interviewer bias would result. For the most part, the literature indicates that among well trained interviewing staffs, interviewer error contributes to the overall variance of estimates as opposed to resulting in biased estimates (Lyberg and Kasprzyk, 1991).

Other Essential Survey Conditions as Sources of Measurement Error. Any data collection effort involves decisions concerning the features which define the overall design of the survey, here referred to as the essential survey conditions. In addition to the sample design and the wording of individual questions and response options, these decisions include:

*

whether to use interviewers or to collect information via some form of self-administered questionnaire;

*

the means for selecting and training interviewers (if applicable);

*

the mode of data collection for interviewer administration (telephone vs. face to face);

*

the method of data collection (paper and pencil, computer assisted);

*

whether to contact respondents for a single-interview (cross-sectional design) or follow respondents over time (longitudinal or panel design);

*

for longitudinal designs, the frequency and periodicity of measurement;

*

the identification of the organization for whom the data are collected; and

*

the identification of the data collection organization.

No one design or set of design features is clearly superior with respect to overall data quality. For example, as noted above, interviewer variance is one source of variability that can obviously be eliminated through the use of a self-administered questionnaire. However, the use of an interviewer may aid in the measurement process by providing the respondent with clarifying information or by probing insufficient responses.

Measurement Error Associated with Autobiographical Information: Theoretical Framework

Three distinct literatures provide the basis for the theoretical framework underlying investigations of measurement error in surveys. These theoretical foundations come from the fields of cognitive psychology, social psychology, and to a lesser extent, social linguistics.1 Although research concerning the existence, direction, magnitude as well as correlates of response error have provided insight into the factors associated with measurement error, there are few fundamental principles which inform either designers of data collection efforts or analysts of survey data as to the circumstances, either individual or design-based, under which measurement error is most likely to be significant or not. Those tenets which appear to be robust across substantive areas are outlined below.

Cognitive Theory

Tourangeau (1984) as well as others (see Sudman, Bradburn, and Schwarz, 1996 for a review) have categorized the survey question and answer process as a four-step process involving comprehension of the question, retrieval of information from memory, assessment of the correspondence between the retrieved information and the requested information, and communication. In addition, the encoding of information, a process outside the control of the survey interview, determines apriori whether the information of interest is available for the respondent to retrieve from long-term memory.

Comprehension of the interview question is the "point of entry" to the response process. Does the question convey the concept(s) of interest? Is there a shared meaning among the research, the interviewer, and the respondent with respect to each of the words as well as the question as a whole? The comprehension of the question involves not only knowledge of the particular words and phrases used in the questionnaire, but also the respondent’s impression of the purpose of the interview, the context of the particular question, and the interviewer’s behavior in the delivery of the question.

The use of simple, easily understood language is not sufficient for guaranteeing shared meaning among all respondents. Belson (1981) found that even simple terms were subject to misunderstanding. For example, Belson examined respondents’ interpretation of the following question 2: "For how many hours do you usually watch television on a weekday? This includes evening viewing." He found that respondents varied in their interpretation of various terms such as "how many hours" (sometimes interpreted as requesting starting and stopping times of viewing), "you" (interpreted to include other family members), "usually" and "watch television" (interpreted to mean being in the room in which the television is on).

Much of the measurement error literature has focused on the retrieval stage of the question answering process, classifying the lack of reporting of an event as retrieval failure on the part of the respondent, comparing the characteristics of events which are reported to those which are not reported. One of the general tenets from this literature concerns the length of the recall period; the greater the length of the recall period, the greater the expected bias due to respondent retrieval and reporting error. This relationship has been supported by empirical data investigating the reporting of consumer expenditures and earnings (Neter and Waksberg, 1964); the reporting of hospitalizations, visits to physicians, and health conditions (e.g. Cannell, et al., 1965); reports of motor vehicle accidents (Cash and Moss, 1969), crime (Murphy and Cowan, 1976); and recreation (Gems, Ghosh, and Hitlin, 1982). However, even within these studies the findings with respect to the impact of the length of recall period on the quality of survey estimates are not consistent. For example, Dodge (1970) found that length of recall was significant in the reporting of robberies but had no effect on the reporting of various other crimes, such as assaults, burglaries, and larcenies. Contrary to theoretically-justified expectations, the literature also offers several examples in which the length of the recall period had no effect on the magnitude of response errors (see for example, Mathiowetz and Duncan, 1989; Schaeffer, 1994). These more recent investigations point to the importance of the complexity of the behavioral experience over time, as opposed to simply the passage of time, as the factor most indicative of measurement error, a finding that harkens back to theoretical discussions of the impact of interference on memory (Crowder, 1976).

Response errors associated with the length of the recall period are typically classified as either telescoping error, that is the tendency of the respondent to report events as occurring earlier (backward telescoping) or more recently (forward telescoping) than they actually occurred or recall decay, the inability of the respondent to recall the relevant events occurring in the past (errors of omission). Forward telescoping is believed to dominate recall errors when the reference period for the questions is of short duration, while recall decay is more likely to have a major effect when the reference period is of long duration. In addition to the length of the recall period, the relative salience of the event affects the likelihood of either telescoping or memory decay.

Another tenet rising from the collaborative efforts of cognitive psychologists and survey methodologists concerns the relationship between true behavioral experience and retrieval strategies undertaken by a respondent. Recent investigations suggest that the retrieval strategy undertaken by the respondent to provide a "count" of a behavior is a function of the true behavioral frequency. Research by Burton and Blair (1991) indicate that respondents choose to count events or items (episodic enumeration) if the frequency of the event/item is low and they rely on estimation for more frequently occurring events. The point at which respondents switch from episodic counting to estimation varies by both the characteristics of the respondent as well as characteristics of the event. As Sudman, et al., (1996) note, " no studies have attempted to relate individual characteristics such as intelligence, education, or preference for cognitive complexity to the choice of counting or estimation, controlling for the number of events" (p. 201). Work by Menon (1993, 1994) suggests that it is not simply the true behavioral frequency that determines retrieval strategies, but also the degree of regularity and similarity among events. According to her hypotheses, those events which are both regular and similar (brushing teeth) require the least amount of cognitive effort to report, with respondents relying on retrieval of a rate to produce a response. Those events occurring irregularly require more cognitive effort on the part of the respondent.

The impact of different retrieval strategies with respect to the magnitude and direction of measurement error is not well understood; the limited evidence suggests that errors of estimation are often unbiased, although the variance about an estimate (e.g., mean value for the population) may be large. Episodic enumeration, however, appears to lead to biased estimates of the event or item of interest, with a tendency to be biased upward for short recall periods and downward for long recall period.

A third tenet springing from this same literature concerns the salience or importance of the behavior to be retrieved. Salience is hypothesized to affect the strength of the memory trace and subsequently, the effort involved in retrieving the information from long-term memory. The stronger the trace, the lower the effort needed to locate and retrieve the information. Cannell, et al. (1965) report that those events judged to be important to the individual were reported more completely and accurately than other events. Mathiowetz (1986) found that short spells of unemployment were less likely to be reported than longer (i.e. more salient) spells.

The last maxim concerns the impact of interference related to the occurrence of similar events over the respondent’s life or during the reference period of interest. Classical interference and information -processing theories suggest that as the number of similar or related events occurring to an individual increases, the probability of recalling any one of those events declines. An individual may lose the ability to distinguish between related events, resulting in an increase in the rate of errors or omission. Inaccuracy concerning the details of any one event may also increase as the respondent makes use of general knowledge or impressions concerning a class of events for reconstructing the specifics of a particular occurrence. Interference theory suggests that "forgetting" is a function of both the number and temporal pattern of related events in long-term memory.

Social Psychology: The Issue of Social Desirability

In addition to asking respondents to perform the difficult task of retrieving complex information from long term memory, survey instruments often ask questions about socially and personally sensitive topics. Some topics are deemed, by social consensus, to be too sensitive to discuss in "polite" society. In the 1990s this is a much shorter list than was true in the 1950s, but most would agree that topics such as sexual practices, impotence, and bodily functions fall within this classification. Some (e.g., Tourangeau, Rips, and Rasinski, forthcoming ) hypothesize that questions concerning income also fall within this category. Other questions may concern topics which have strong positive or negative normative responses (e.g., voting; the use of pugnacious terms with respect to racial or ethnic groups) or for which there may be criminal retribution (e.g. use of illicit drugs; child abuse).

The sensitivity of the behavior or attitude of interest may affect both the encoding of the information as well as the retrieval and reporting of the material; little of the survey methodological research has addressed the point as which the distortion or measurement error occurs with respect to the reporting of sensitive material. Even if the respondent is able to retrieve accurate information concerning the behavior of interest, he or she may choose to edit this information at the response formation stage as a means to reduce the costs, ranging from embarrassment to potential negative consequences beyond the interview situation, associated with revealing the information.

Applicability of Findings to the Measurement of Economic Phenomena

One of the problems in drawing inferences from other substantive fields to that of economic phenomena is the difference in the nature of the measures of interest. Much of the assessment of the quality of household-based survey reports concerns the reporting of discrete behaviors; many of the economic measures that are the subject of inquiry with respect tot he measurement of the welfare population are not necessarily discrete behaviors or even phenomena that can be linked to a discrete memory. Some of the phenomena of interest could be considered trait phenomena. Lets consider the reporting of occupation. We speculate that the cognitive process by which one formulates a response to a query concerning current occupation is different from the process related to reporting number of doctor visits during the past year.

For other economic phenomena, we speculate that individual differences in the approach to formulating a response impact the magnitude and direction of error associated with the measurement process. Consider the reporting of current earnings related to employment. For some respondents, the request to report current earnings requires little cognitive effort--it may almost be an automatic response. For these individuals, wages may be considered a characteristic of their self identity, a trait related to how they define themselves. For other individuals, the request for information concerning current wages may require the retrieval of information from a discrete episode (the last paycheck), a recent rehearsal of the information (the reporting of wages in an application for a credit card), or the construction of an estimate at the time of the query based on the retrieval of information relevant to the request.

Given both the theoretical and empirical research conducted within multiple branches of psychology and survey methodology, what would we anticipate are the patterns of measurement error for various economic measures? The response to that question is a function of how the respondent’s task is formulated and the very nature of the phenomena of interest. For example, asking a respondent to provide an estimate of the number of weeks of unemployment during the past year is quite different from the task of asking the respondent to report the starting and stopping dates of each unemployment spell for the past year. For individuals who are in a steady-state (constant employment or unemployment), neither task could be considered a difficult cognitive process. For these individuals, unemployment is not a discrete event but rather may become encoded in memory as a trait which defines the respondent. However, for the individual with sporadic spells of unemployment throughout the year, the response formulation process would most likely differ for the two questions. While the response formulation process for the former task permits an estimation strategy on the part of the respondent, the later requires the retrieval of discrete periods of unemployment. For the reporting of these discrete events, we would hypothesize that patterns of response error evident in the reporting of counts across other substantive fields would be observed. Similar patterns of differences may be observed as a function of requesting the respondent to report current earnings as compared to directing them to think about their last paycheck and report the gross earnings. With respect to social desirability, we would anticipate patterns similar to those evident in other types of behavior, overreporting of socially desirable behaviors and underreporting of socially undesirable behaviors.

Measurement Error in Household Reports of Income

As noted by Moore, Stinson and Welniak (1999), the reporting of income by household respondents in many surveys can be characterized as a two step process, the first involving the correct enumeration of sources of household income and the second, the accurate reporting of the amount of the income for the specific source. They find that response error in the reporting of various sources and amounts of income may be due to a large extent to cognitive factors, such as "definitional issues, recall and salience problems, confusion, and sensitivity" (p. 155). We return to these cognitive factors when considering alternative means for reducing measurement error in surveys of the low income population.

Earnings

Empirical evaluations of household-reported earnings information include the assessment of annual earnings, usual earnings (with respect to a specific pay period), most recent earnings, and hourly wage rates. These studies rely on various sources of validation data, including the use of employers’ records, administrative records, and respondents’ reports for the same reference period reported at two different times.

With respect to reports of annual earnings, mean estimates appear to be subject to relatively small levels of response error although absolute differences indicate significant over and under reporting at the individual level. For example, Borus (1970) found that the mean error in reports of annual earnings was small and insignificant; however over 10 percent of the respondents misreported annual earnings by $1,000 (based on a mean of $2,500).

Similarly, Duncan and Hill (1985) compared reports of annual earnings for calendar year 1981 and 1982 with information obtained from the employer’s records. For neither year is the mean of the simple difference between the two data sources statistically significant (8.5 percent and 7 percent of the mean, repsectively), although the absolute differences for each year indicate significant under- and over reporting. The error-to-true variance ratio for calendar year 1982 annual earnings was quite small (.154) but significantly larger for 1981 (.301). The simple correlation between errors in the reports of earnings across the two years is .43. Comparison of measures of change in annual earnings based on the household report and the employer records indicate no difference; interview reports of absolute change averaged $2,992 (or 13%) compared to the employer-based estimate of $3,399 (or 17%). Under the assumption of equal error variance across the two years, the ratio of error-to-true variance for change in annual earnings between the two years is .320; without this assumption, the simple difference between 1981 and 1982 produces a an error-to-true variance ratio of .501.

Although the findings noted above are based on small samples drawn from either a single geographic area (Borus) or a single firm (Duncan and Hill) , the results parallel the findings from empirical research comprised of nationally representative samples. Bound and Krueger (1991) examined error in annual earnings as reported in the March, 1978 Current Population Survey. Although the error was distributed about approximately a zero mean for both men and women, the magnitude of the error was substantial.

In addition to examining bias in mean estimates, the studies by Duncan and Hill and Bound and Krueger examined the relationship between measurement error and true earnings. Both studies indicate a significant negative relationship between error in reports of annual earnings and the true value of annual earnings. Similar to Duncan and Hill (1985), Bound and Krueger (1991) report positive autocorrelation (.4 for men and .1 for women) between errors in CPS-reported earnings for the two years of interest, 1976 and 1977.

Both Duncan and Hill (1985) and Bound and Krueger (1991) explore the implications of measurement error for earnings models. Duncan’s and Hill’s model relates the natural logarithm of annual earnings to three measures of human capital investment: education, work experience prior to current employer, and tenure with current employer, using both the error-ridden self-reported measure of annual earnings and the record-based measure as the left-hand-side variable. A comparison of the ordinary least squares parameter estimates based on the two dependent variables suggests that measurement error in the dependent variable has a sizeable impact on the parameter estimates. For example, estimates of the effects of tenure on earnings based on interview data were 25% lower than the effects based on record earnings data. Although the correlation between error in reports of earnings and error in reports of tenure was small (.05) and insignificant, the correlation between error in reports of earnings and actual tenure was quite strong (-.23) and highly significant, leading to attenuation in the estimated effects of tenure on earnings based on interview information.

Bound and Krueger (1991) also explore the ramifications of an error-ridden left-hand-side variable by regression error in reports of earnings with a number of human capital and demographic factors, including education, age, race, marital status, region, and SMSA. Similar to Duncan and Hill, the model attempts to quantify the extent to which the correlation between measurement error in the dependent variable and right-hand-side variables biases the estimates of the parameters. However, in contrast to Duncan and Hill, Bound and Krueger conclude that mismeasurement of earnings leads to little bias when CPS-reported earnings are on the left-hand-side of the equation.

The reporting of annual earnings within the context of a survey is most likely aided by the number of times the respondent has rehearsed the retrieval and reporting process for this information. For some members of the population, we contend that the memory for one’s annual earnings is reinforced throughout the calendar year, for example, in the preparation of federal and state taxes or the completion of applications for credit cards and loans. To the extent that these requests have motivated the respondent to determine and report an accurate figure, such information should be encoded in the respondent’s memory. Subsequent survey requests should therefore be "routine" in contrast to many of the types of questions posed to a survey respondent. Hence we would hypothesize that response error in such situations would result from retrieval of the wrong information (e.g., annual earnings for calendar year 1996 rather than 1997; net rather than gross earnings), social desirability issues (e.g., overreports related to presentation of self to the interviewer), or privacy concerns, which may lead to either misreporting or item nonresponse. Although the limited literature on the reporting of earnings among the low income population indicates a high correlation between record and reported earnings (Halsey, 1978), we hypothesize that for some members of the population, for example, low income individuals for whom there is little or no opportunity for rehearsal of annual earnings information, a survey request would not be routine and may require very different response strategies than for those with rehearsal opportunities. Hence, although asking annual earnings as a single question may yield reasonably accurate responses among the general population, we speculate that, for the low income population, those with lose ties to the labor force, or those for whom the retrieval of earnings information requires separate estimates for multiple jobs, the request for earnings information may require a decomposition approach (that is, asking about earnings for individual jobs) or some type of estimation approach. We return to a discussion concerning possible repairs in the last section of the paper.

In contrast to the task of reporting annual earnings, the survey request to report weekly earnings, most recent earnings, or usual earnings, is most likely a relatively unique request and one which may involve the attempted retrieval of information that may not have been encoded by the respondent, the retrieval of information that has not been accessed by the respondent before, or the calculation of an estimate "on the spot." Hence, we would anticipate that requests for earnings in any metric apart from a well-rehearsed metric would lead to significant differences between household reports and validation data.

Borus (1966) reports a high correlation (.95) between household and employer’s records of weekly earnings, small mean absolute deviations between the two sources, and equal amounts of over- and under-reporting. In contrast, Carstensen and Woltman (1979), in a study which examined the quality of reports for various rates of pay (per hour, week, or month), report significant differences between household reports and employer’s record in the direction of underreporting for hourly and weekly reports and overreporting for monthly reports. Rogers, Brown, and Duncan (1993), report correlations of .49 and .46 between household reports and company records for the most recent and usual pay, respectively, in contrast to a correlation of .79 for reports of annual earnings. In addition, they calculated an hourly wage rate from the respondents’ reports of annual, most recent, and usual earnings and hours and compared that hourly rate to the rate as reported by the employer; error in the reported hours for each respective time period therefore contributes to noise in the hourly wage rate. Similar to the findings for earnings, correlation between the employer’s records and self reports were highest when based on annual earnings and hours (.61) and significantly lower when based on most recent earnings and hours and usual earnings and hours (.38 and .24, respectively).

Hourly wages calculated from the CPS reported earnings and hours compared to employers’ records indicate a small, but significant, rate of underreporting, which may be due to an overreporting of hours worked, an underreporting of annual earnings, or a combination of the two (Mellow and Sider, 1983). Similar to Duncan and Hill (1985), Mellow and Sider examined the impact of measurement error in wage equations; they concluded that the structure of the wage determination process model was unaffected by the use of respondent- or employer-based information, although the overall fit of the model was somewhat higher with employer-reported wage information.

As noted earlier, one of the shortfalls with the empirical investigations concerning the reporting of earnings is the lack of studies targeted at those for whom the reporting task is most difficult--those with multiple jobs or sporadic employment. Although the empirical findings suggest that annual earnings are reported more accurately than earnings for other periods of time, the opposite may be true among those for whom annual earnings are highly variable and the result of complex employment patterns.

One of the major concerns with respect to earnings questions in surveys of TANF leavers is the reference period of interest. Many of the surveys request that respondents report earnings for reference periods that may be of little salience to the respondent or for which the determination of the earnings is quite complex. For example, questions often focus on the month in which the respondent left welfare (which may have been several months prior to the interview) or the six month period prior to exiting welfare. For example, consider the following series of questions:

  1. 1. During the six months you were on welfare before you go off in MONTH, did you ever have a job which paid you money? [Awkward reference period; could be improved through reference to specific months and encouraging respondent to use calendar. May further be complicated if period falls over two calendar years]
  2. 2.At the time you left welfare in MONTH, did you have a job or jobs which paid you money?

We would anticipate response error for these types of reference periods to be significantly greater than the levels reported in the empirical literature for annual, weekly, or hourly earnings.

Transfer Program Income and Child Support

For most surveys, the reporting of transfer program income is a two-stage process in which respondents first report recipiency (or not) of a particular form of income and then, among those who report recipiency, the amount of the income. One of the shortcomings of many studies which assess response error associated with transfer program income is the design of the study, in which the sample for the study is drawn from those known to be participants in the program. Responses elicited from respondents are then verified with administrative data. Retrospective or reverse record check studies limit the assessment of response error, with respect to recipiency, to determining the rate of underreporting; prospective or forward record check studies which only verify positive recipiency responses are similarly flawed since by design they limit the assessment of response error only to overreports. In contrast, a “full” design permits the verification of both positive and negative recipiency responses and includes in the sample a full array of respondents. Validation studies which sample from the general population and link all respondents, regardless of response, to the administrative record of interest, represent full study designs.

Focusing our attention first on reporting of receipt of a particular transfer program, among full design studies, there does appear to be a tendency for respondents to underreport receipt, although there are also examples of overreporting recipiency status. For example, Oberheau and Ono (1975) report a low correspondence between administrative records and household report for receipt of AFDC (monthly and annual) and Food Stamps (between .3 and .5), but relatively low net rates of under- and over-reporting. Underreporting of the receipt of general assistance as reported in two studies is less than 10 percent (e.g, David, 1962). In a study reported by Marquis and Moore (1990), respondents were asked to report recipiency status for eight months (in two successive waves of SIPP interviews). Although Marquis and Moore report a low error rate of approximately 1% to 2%, the error rate among true recipients is significant, in the direction of underreporting. For example, among those receiving AFDC, respondents failed to report receipt in 49% of the person-months. Underreporting rates were lowest among OASDI beneficiaries, for which approximately 5% of the person-months of recipiency were not reported by the household respondents. The mean rates of participation based on the two sources suggest little difference, with absolute differences between the two sources differed by less than one percentage point for all income types. However, the rareness of some of these programs means that small absolute biases mask high rates of relative bias among true participants, ranging from +1% for OASDI recipiency to almost 40% for AFDC recipiency. In a follow-up study, Moore, Marquis, and Bogen (1996) compared underreporting rates of known recipients to overreporting rates for known non-recipients and found that underreporting rates to be much higher than the rate of false positives by non-recipients. They also note that underreporting on the part of known recipients tends to be due to failure to ever report receipt of a particular type of income rather than failure to report specific months of receipt.

In contrast, Yen and Nelson (1996) found a slight tendency among AFDC recipients to overreport receipt in any given month, such that estimates based on survey reports exceeded estimates based on records by approximately 1 percentage point. Oberheu and Ono (1975) also note a net overreporting for AFDC (annual) and Food Stamp recipiency (annual), 8% and 6%, respectively.

The studies vary in their conclusions with respect to the direction and magnitude of response error concerning the amount of the transfer. Several studies report a significant underreporting of assistance amount (e.g., David, 1962; Livingston, 1969; Oberheau and Ono, 1975; Halsey, 1978) or significant differences between the survey and record reports (Grondin and Michaud, 1994). Other studies report little to no difference in the amount based on the survey and record reports. Hoaglin (1978) finds no difference in median response error for welfare amounts and only small negative differences in the median estimates for monthly Social Security Income. Goodreau, Oberheu, and Vaughan (1984) found that 65% of the respondent accurately report the amount of AFDC support; the survey report accounted for 96% of the actual amount of support. Although Halsey (1978) reported a net bias in the reporting of unemployment insurance amount of -50%, Dibbs, Hale, Loverock, and Michaud (1995) conclude that the average household report of unemployment benefits differed from the average true value by approximately 5% ($300 on a base of $5600).

Schaeffer (1994) compared custodial parents’ reports of support owed and support paid to court records. The distribution of response errors indicated significant under and over reporting of both the amount owed and the amount paid. The study also examined the factors contributing to the absolute level of errors in the reports of amount owed and paid; the findings indicate that the complexity of the respondent’s support experience had a substantial impact on the accuracy of the reports. As noted by Schaeffer (1994):

The errors made by those who were owed or paid support for only some months in 1986 were on the order of 74 and 15 times larger than error of those with no support. In contrast, those who were paid support for all 12 months during 1986 did not make significantly larger reporting errors than did those who were paid none; the error for those who were owed support all 12 months was approximately 5 times larger than it was for those who were owed no support. Errors also increase substantially when the amount of support owed or paid was variable (pp. 154-156)

The findings from the study indicate that characteristics of the events (payments) were more important in predicting response error than characteristics of the respondent or factors related to memory decay. The analysis suggest two areas of research directed toward improving the reporting of child support payments: research related to improving the comprehension of the question (specifically clarifying and distinguishing child support from other transfer payments), identifying respondents for whom the reporting process is difficult (e.g., use of a filter question) with follow-up questions specific to the behavioral experience.

Hours Worked

The number of empirical investigations concerning the quality of household reports of hours worked are few in number but consistent with respect to the findings. Regardless of whether the measure of interest is hours worked last week, annual work hours, usual hours worked, or hours associated with the previous or usual pay period, comparisons between company records and respondents’ reports indicate an overestimate of the number of hours worked.

Carstensen and Woltman (1979) compared reports of "usual" hours worked per week. They found that compared to company reports, estimates of the mean usual hours worked were significantly overreported by household respondents, 37.1 hours vs. 38.4 hours, respectively, a difference on average of 1.33 hours, or 3.6% of the usual hours worked. Similarly, Mellow and Sider (1983) report that the mean difference between the natural log of worker reported hours and the natural log of employer reported hours as positive (.039). Self reports exceeded employer records by almost 4% on average; however, for approximately 15% of the sample, the employer records exceeded the estimate provided by the respondent. A regression explaining the difference between the two sources indicates that professional and managerial workers were more likely to overestimate their hours, as were respondents with higher levels of education and nonwhite respondents. In contrast, female respondents tended to underreport usual hours worked.

Similar to their findings concerning the reporting of earnings, Rogers, Brown, and Duncan (1993) report that the correlation between self reports and company records is higher for annual number of hours worked (.71) than for either reports of hours associated with the previous pay period (.61) or usual pay period (.61). Barron, Berger, and Black (1997) report a high correlation between employers’ records and respondents’ reports of hours last week, .769. Measurement error in hours worked are not independent of the true value; as reported by Rogers, Brown and Duncan (1993), the correlation between error in reports of hours worked and true scores ranged from -.307 for annual hours worked in the calendar year immediately prior to the date of the interview to -.357 for hours associated with the previous pay period and -.368 for hours associated with "usual" pay period.

Examination of a standard econometric model with earnings as the left-hand-side variable and hours worked as one of the predictor variables indicates that the high correlation between the errors in reports of earnings and hours (ranging from .36 for annual measures .54 for last pay period) seriously biases parameter estimates. For example, regressions of reported and company record annual earnings (log) on record or reported hours, age, education, and tenure with the company provide a useful illustration of the consequences of measurement error. Based on respondent reports of earnings and hours, the coefficient for hours (log hours) is less than 60% of the coefficient based on company records (.41 vs. 1.016) while the coefficient for age is 50% larger in the model based on respondent reports. In addition, the fit of the model based on respondent reports is less than half that of the fit based on company records (R2 of .352 vs. .780).

Duncan and Hill (1985) compare the quality of reports of annual hours worked for two different reference periods, the prior calendar year and the calendar year ending 18 months prior to the interview. The quality of the household reports declines as a function of the length of the recall period, although the authors report significant overreporting for each of the two calendar years of interest. The average absolute error in reports of hours worked (157 hours) was nearly 10% of the mean annual hours worked for 1982 (µ=1,603) and nearly 12% (211 hours) of the mean for 1981 (µ=1,771). Comparison of change in hours worked reveal that although the simple differences calculated from two sources have similar averages, the absolute amount of change reported in the interview significantly exceeds that based on the record report.

In contrast to the findings with respect to annual earnings, we see both a bias in the population estimates as well as bias in the individual reports of hours worked in the direction of overreporting. This finding persists across different approaches to measuring hours worked, regardless if the respondent is asked to report on hours worked last week (CPS) or account for the weeks worked last year, which are then converted to total hours worked during the year (PSID). Whether this is a function of social desirability or related to the cognitive processes associated with formulating response to the questions associated with the measurement of hours work is something that can only be speculated on at this point.

Unemployment

In contrast to the small number of studies which assess the quality of household reports of hours worked, there a number of studies which have examined the quality of unemployment reports. These studies encompass a variety of unemployment measures including annual number of person years of unemployment, weekly unemployment rate, occurrence and duration of specific unemployment spells, and total annual unemployment hours. Only one study reported in the literature, the PSID validation study (Duncan and Hill, 1985; Mathiowetz, 1986; Mathiowetz and Duncan, 1988), compares respondents’ reports with validation data; the majority of the studies rely on comparisons of estimates based on alternative study designs or examine the consistency in reports of unemployment duration across rounds of data collection. In general, the findings suggest that retrospective reports of unemployment by household respondents underestimate unemployment, regardless of the unemployment measure of interest.

The studies by Morganstern and Bartlett (1974), Horvath (1982), and Levine (1993) compare the contemporaneous rate of unemployment as produced by the monthly CPS to the rate resulting from retrospective reporting of unemployment during the previous calendar year.3 The measures of interest vary from study to study; Morganstern and Bartlett focus on annual number of person years of unemployment as compared to average estimates of weekly unemployment (Horvath) or an unemployment rate, as discussed by Levine. Regardless of the measure of interest, the empirical findings from the three studies indicate that when compared to the contemporaneous measure, retrospective reports of labor force status result in an underestimate of the unemployment rate.

Across the three studies, the underreporting rate is significant and appears to be related to demographic characteristics of the individual. For example, Morgenstern and Bartlett (1974) report discrepancy rates in the range of around 3 percent to 24 with the highest discrepancy rates among women (22 percent for black women; 24 percent for white women). Levin compared the contemporaneous and retrospective reports by age, race, and gender. He found the contemporaneous rates to be substantially higher relative to the retrospective reports for teenagers, regardless of race or sex, and for women. Across all of the years of the study, 1970-1988, the retrospective reports for white males, ages 20 to 59, were almost identical to the contemporaneous reports.

Duncan and Hill (1985) found that the overall estimate of mean number of hours unemployed in years t and t-1 based on employer reports and company records did not differ significantly. However, micro-level comparisons, reported as the average absolute difference between the two sources, was large relative to the average amount of unemployment in each year, but significant only for reports of unemployment occurring in 1982.

In addition to studies which examine rates of unemployment, person-years of unemployment, or annual hours of unemployment, several empirical investigations have focused on spell-level information, examining reports of the specific spell and duration of the spell. Using the same data as presented in Duncan and Hill (1985), Mathiowetz and Duncan (1988) found that at the spell level, respondents failed to report over 60 percent of the individuals spells. Levine (1993) found that between 35 percent and 60 percent of persons failed to report an unemployment spell one year after the event. Failure to report a spell of unemployment, in both studies, was related, in part, to the length of the unemployment spell; short spells of unemployment were subject to higher rates of underreporting.

The findings suggest that, similar to other types of discrete behaviors and events, the reporting of unemployment, is subject to deterioration over time, although the passage of time alone may not be the fundamental factor affecting the quality of the reports, but the complexity of the behavioral experience the longer the recall period. Both the micro-level comparisons as well as the comparisons of population estimates suggest that behavioral complexity interferes with the respondent’s ability to accurate report unemployment for distant recall periods. Hence we see greater underreporting among population subgroups who traditionally have looser ties to the labor force (teenagers, women). Although longer spells of unemployment were subject to lower levels of errors of omission, a finding that supports other empirical research with respect to the effects of salience, at least one study found that errors in reports of duration were negatively associated with the length of the spell. Whether this is indicative of an error in cognition or an indication of reluctance to report extremely long spells of unemployment (social desirability) is unresolved.

Sensitive Questions: Drug Use, Abortions

A large body of methodological evidence indicates that embarrassing or socially undesirable behaviors are misreported in surveys (e.g., Bradburn, 1983). For example, comparisons between estimates of the number of abortions based on survey data from the National Survey of Family Growth (NSFG) and estimates based on data collected from abortion clinics suggest that fewer than half of all abortions are reported in the NSFG (Jones and Forrest, 1992). Similarly, comparisons between survey reports of cigarette smoking compared to sales figures indicates significant underreporting on the part of household respondents, with the rate of underreporting increasing over time, a finding attributed by the authors as a function of increasing social undesirability (Warner, 1978). Although one can find other examples in which survey reports of sensitive behaviors are compared to aggregate measures of the behavior, micro-level comparisons are rare.

Although validation studies of reports of sensitive behaviors are rare, there is a growing body of empirical literature which examines reports of sensitive behaviors as a function of mode of data collection, method of data collection, question wording, and context (e.g., Tourangeau and Smith, 1996). These studies have examined the reporting of abortions, AIDS risk behaviors, use of illegal drugs, and alcohol consumption. The hypothesis for these studies is that, given the tendency to underreport sensitive or undesirable behavior, the method or combination of essential survey design features which yield the highest estimate is the "better" measurement approach.

Studies comparing self administration to interviewer administered questions (either face to face or telephone) indicate that self-administration of sensitive questions increases levels of reporting relative to administration of the same question by an interviewer. Increases in the level of behavior have been reported in self-administered surveys (using paper and pencil questionnaires) concerning abortions (London and Williams, 1990), alcohol consumption (Aquilino and LoSciuto, 1990), and drug use (Aquilino, 1994). Similar increases in the level of reporting sensitive behaviors have been reported when the comparisons focus on the difference between interviewer administered questionnaire and computer assisted self administered (CASI) questionnaires.

One of the major concerns with moving from an interviewer-administered questionnaire to self-administration, is the problem of limiting participation to the literate population. Even among the literate population, the use of self-administered questionnaire presents problems with respect to following directions (e.g., skip patterns). The use of audio computer assisted self- interviewing (ACASI) techniques, circumvents both problems. The presentation of the questions in both written and auditory form (through headphones) preserves the privacy of a self-administered questionnaire without the restriction imposed by respondent literacy. The use of computers for the administration of the questionnaire eliminates two problems often seen in self-administered paper and pencil questionnaires--missing data and incorrectly followed skip patterns. A small, but growing body of literature (e.g., O’Reilly, Hubbard, Lessler, Biemer, and Turner, 1994; Tourangeau and Smith, 1996) finds that ACASI methods are acceptable to respondents and appear to improve the reporting of sensitive behaviors. Cynamon and Camburn (1992) found that the using portable cassette players rather than computers to administer questions (with the respondent recording answers on a paper form) was also effective in increasing reports of sensitive behaviors.

Methods for Reducing Measurement Error

As we consider means for reducing measurement error, we return to the theoretical frameworks which address the potential sources of error: those errors associated with problems of cognition and those resulting from issues associated with social desirability.

Repairs Focusing on Problems of Cognition

Comprehension. Of primary importance in constructing question items is to assure comprehension on the part of the respondent. Although the use of clear and easily understood language is a necessary step toward achieving that goal, simple language alone does not guarantee that the question is understood in the same manner by all respondents.

The literature examining comprehension problems in the design of income questions indicates that defining income constructs in a language easily understood by survey respondents is not easy (Moore, et al., 1999). Terms which most researchers would consider to be well understood by respondents may suffer from differential comprehension. For example, Stinson (1997) found significant diversity with respect to respondent’s interpretation of the term "total family income." Similarly, Bogen (1995) reported that respondents tend to omit sporadic self-employment and earnings from odd jobs or third or fourth jobs in their reports of income due to the respondents’ interpretation of the term "income."

Comprehension of survey questions is affected by several factors, including the length of the question, the syntactical complexity, the degree to which the question includes instructions such as inclusion and exclusion clauses, as well as the use of ambiguous terms. Consider, for example, the complexity of the following questions:

Example 1:

Since your welfare benefits ended in (FINAL BENEFIT MONTH), did you take part for at least one month in any Adult Basic Education (ABE) classes for improving you basic reading and math skills, or GED classes to help you prepare for the GED test, or classes to prepare for a regular high school diploma?

Example 2:

In (PRIOR MONTH), did you have any children of your own living in the household? Please include any foster or adopted children. Also include any grandchildren living with you.

Example 3:

Since (FINAL BENEFIT MONTH), have you worked for pay at a regular job at all? Please don’t count unpaid work experience, but do include any paid jobs, including paid community service jobs or paid on-the-job training.

Each of these items is cognitive complex. The first question requires the respondent to process three separate categories of education, determine whether the conditional phrase "at least one month" applies only to the adult basic education classes or also to the GED or regular high school classes, and also attribute a reason for attending ABE ("improving reading and math skills") or GED classes. Separating example 1 into three simple items, prefaced by an introductory statement concerning types of education would make the task more manageable for the respondent. Examples 2 and 3 suffer from the problem of providing an exclusion or inclusion (or in the case of example 3, both) clause after the question. Both would be improved by defining for the respondent what the question concerns and then asking the question, so that the last thing the respondent hears is the question. Example 2 may be improved by simply asking separate questions concerning own children, foster children, and grandchildren.

With respect to question length, short questions are not always better. Cannell and colleagues (Cannell, Marquis, and Laurent, 1977; Cannell, Miller, and Oksenberg, 1981) demonstrated that longer question providing redundant information can lead to increased comprehension, in part due to the fact that the longer question provides additional context for responding as well as longer time for the respondent to think about the question and formulate a response. On the other hand, longer questions which introduce new terms or become syntactically complex will result in lower levels of comprehension.

Comprehension can suffer from both lexical and structural ambiguities. For example the sentence "John went to the bank" could be interpreted as John going to a financial institution or the side of a river. Lexical problems are inherent in a language in which words can have different interpretations. Although difficult to fix, interpretation can be aided through context and respondent’s usual use of the word (in this case, most likely the financial institution interpretation). Note that when constructing a question, one must be consider regional and cultural differences in language and avoid terms which lack a clearly defined lexical meaning (e.g., "welfare reform"). Structural ambiguities arise when the same word can be used as different parts of speech--for example as both a verb or adjective in the sentence "Flying planes can be dangerous." Structural ambiguities can most often be repaired through careful wording of the question.

Questionnaire designers often attempt to improve comprehension by grouping questions so as to provide a context for a set of items, writing explicit questions, and, if possible, writing closed-ended items in which the response categories may aid in the interpretation of the question by the respondent. In addition, tailoring questions to accommodate the language of specific population subgroups is feasible with computer assisted interviewing systems.

Comprehension difficulties are best identified and repaired through the use of selected pretesting techniques such as cognitive interviewing or expert panel review (e.g., Presser and Blair,1994; Forsyth and Lessler, 1991). Requesting respondents to paraphrase the question in their own words often provides insight into different interpretations of a question; similarly, the use of other cognitive interviewing techniques such as think-aloud interviews or the use of vignettes can be useful in identifying comprehension problems as well as offer possible alternative wording options for the questionnaire designer.

Retrieval. Many of the questions of interest in surveying the welfare popuation request that the respondent report on retrospective behavior, often for periods covering several years or more (e.g., year of first receipt of AFDC benefits). Some of these questions require that the respondent date events of interest, thus requiring episodic retrieval of a specific event. Other questions request that respondents provide a numeric estimate (e.g., earnings from work last month); in these cases the respondent may rely on episodic retrieval (e.g., the more recent pay check), reconstruction, an estimation strategy, or a combination of retrieval strategies to provide a response. As noted earlier, response strategies are often a function of the behavioral complexity experienced by the respondent, however, the strategy used by the respondent can be affected by the wording of the question.

Although both responses based on episodic enumeration and estimation are subject to measurement error, the literature suggests that questions which direct the respondent toward episodic enumeration tend to suffer from errors of omissions (underreports) due to incomplete memory searches on the part of the respondent whereas responses based on estimation strategies result in both inclusion and exclusion errors, resulting in greater variance, but unbiased population estimates (Sudman, Bradburn, and Schwarz, 1996). The findings from Mathiowetz and Duncan (1986) illustrate the difference in reports based on estimation strategies as compared to episodic enumeration. In their study, population estimates of annual hours of unemployment for a two year reference period based on respondents reports of unemployment hours were reasonably accurate. In contrast, when respondents had to report the months and years of individual spells of unemployment (requiring episodic enumeration) over 60 percent of the individual spells of unemployment were not reported.

Several empirical investigations have identified means by which to improve the reporting of retrospective information for both episodic enumeration and estimation-based reports. These questionnaire design approaches include:

Event History Calendar. Work in the field of cognitive psychology has provided insight as to the structure of autobiographic information in memory. The research indicates that "certain types of autobiographical memories are thematically and temporally structured within an hierarchical ordering." (Belli, 1998) Event history calendars (Freedman, et al, 1988) draw on both thematic and temporal information as a means for improving the reporting of what, when, and how often events happened. Whereas traditional survey instruments ask for retrospective reports through a set of discrete questions (e.g., "In what month and year did you last receive welfare payments?") thereby emphasizing the discrete nature of events, event history calendars emphasize the relationship between events within broad thematic areas or life domains (work, living arrangements, marital status, child bearing and rearing). Major transitions within these domains such as getting married or divorced, birth of a child, moves to a new house, or the start of a job, are identified by the respondent and recorded in such as way to facilitate "an extensive use of autobiographical memory networks and multiple paths of memory associated with top-down, sequential, and parallel retrieval strategies." (Belli, 1998). If the question items of interest require the dating of several different types of events, the literature suggests that the use of event history calendars will lead to improved reporting. For example, event history calendars could prove to be beneficial in eliciting accurate responses to questions such as "What was the year and month that you first received welfare cash assistance as an adult?"

Landmark Events. The use of an event history calendar is most beneficial if the questionnaire focuses on the dating and sequencing of events and behaviors across several life domains. In some cases, the questionnaire contains a limited number of questions for which the respondent must provide a date or a correct sequence of events. In these cases, studies have indicated that the use of landmark dates can improve the quality of reporting by respondents (Loftus and Marburger, 1983). Landmark events are defined as either public or personal landmarks; for some of these the respondent can provide an accurate date (personal landmark such as birthday, anniversary) whereas public landmarks can be accurately dated by the researcher. Landmarks are effective for three reasons: (1) landmark dates make effective use of the cluster organization of memory; (2) landmark dates may convert a difficult absolute judgment of recentcy to an easier relative judgment; and (3) landmark dates may suggest to the respondent the need to pay attention to exact dates and not simply imprecise dates. One means by which to operationalize landmark dates is to begin the interview with the respondent noting personal and/or public landmark dates on a calendar which can be used for reference throughout the interview.

Use of Records. If the information has not been encoded in memory, the response quality will be poor no matter how well the questions have been constructed. For some information, the most efficient and effective means by which to improve the quality of the reported data is to have respondents access records. Several studies report an improvement in the quality of asset and income information when respondents used records (e.g., Maynes, 1968; Grondin and Michaud, 1994; Moore, et al, 1996). Two factors often hinder questionnaire designers from requesting respondents use records: interviewers’ reluctance and mode of data collection. Although in some cases interviewers have been observed discouraging record use (Marquis, 1990), studies which request detailed income and expenditure such as the Survey of Income and Program Participation and the National Medical Expenditure Survey have both reported success in encouraging respondents to use records (Moore, et al, 1996).. Record use by respondents is directly related to the extent to which interviewers have been trained to encourage their use by respondents. For telephone interviews, the fear is that encouraging record use may encourage nonresponse; a small body of empirical literature does not support this notion (Grondin and Michaud, 1994). One form of record to be consider is the prospective creation of a diary that is referenced by the respondent during a retrospective interview.

Recall vs. Recognition. Any free recall task, for example, the enumeration of all sources of income, is a cognitively more difficult task than the task of recognition, for example, asking the respondent to indicate which of a list of income sources is applicable to his or her situation. Consider the two approaches taken in examples 1 and 2:

Example 1:

In (PRIOR MONTH), did you receive any money or income from any other source? This might include (READ SLOWLY) unemployment insurance, workers’ compensation, alimony, rent from a tenant or boarder, an income tax refund, foster child payments, stipends from training programs, grandparents’ social security income and so on.

Example 2:

Next, I will read a list of benefit programs and types of support and I’d like you to tell me whether you or someone in your home gets this.

Food Stamps

Medicaid

Child care assistance

Child support from a child’s parent

Social Security

In the first example, the respondent must process all of the items together; most likely after the first or second item on the list was read, the respondent failed to hear or process the remaining items on the list. Hence, the list does not provide an effective recognition mechanism. In the second example, the respondent is given time to process each item on the list individually (the entire list consists of 20 items).

Complex Behavioral Experience. Simple behavioral experiences are relatively easy to report even over long reference periods whereas complex behavioral experiences can be quite difficult to reconstruct. For example, the experience of receiving welfare benefits continuously over a 12 month period is quite different than the experience of receiving benefits for eight of the 12 months. The use of filter questions to identify those for whom the behavioral experience is complex would permit the questionnaire designer to concentrate design efforts on those respondents for whom the task is most difficult. Those with complex behavioral experiences could be questioned using an event history calendar whereas those for whom the recent past represents a steady state could be asked a limited number of discrete questions.

Recall Strategies. When respondents are asked to report a frequency or number of times an event or behavior occurred, they draw on different response strategies to formulate a response. The choice of response strategy is determined, in part, by the actual number or frequency as well as the regularity of the behavior. Rare or infrequent events are often retrieved through episodic enumeration in which the respondent attempts to retrieve each occurrence of the event. Such strategies are subject to errors of omission as well as misdating of the event by the respondent. When the event or behavior or interest occurs frequently, respondents will often use some form of estimation strategy to formulate a response. These strategies include rule-based estimation (recall a rate and apply to time frame of interest), automatic estimation (drawn from a sense of relative or absolute frequency), decomposition (estimate the parts and sum), normative expectations, or some form of heuristic, such as availability heuristic (based on the speed of retrieval). All estimation approaches are subject to error, but a well designed questionnaire can both suggest the strategy for the respondent to use and attempt to correct for the expected biases. For example, if the behavior or event of interest is expected to occur on a regular basis, a question which directs the respondent to retrieve the rule, apply the rule to the time frame of interest, and then probes to elicit exceptions to the rule may be a good strategy for eliciting a numeric response.

Current vs. Retrospective Reports. Current status is most often easier to report, with respect to cognitive difficulty, than retrospective status, so it is useful to often consider beginning questions concerning current status. Information retrieved as part of the reporting of current status will also facilitate retrieval of retrospective information.

Repairs Focusing on Problems Related to Social Desirability

Questions for which the source of the measurement error is related to perceived sensitivity of the items or the social undesirable nature of the response often call for the use of question items or questionnaire modes which provide the respondent greater sense of confidentiality or even anonymity as a means for improving response quality. The questionnaire designer must gauge the level of sensitivity or threat (or elicit information on sensitivity or threat through developmental interviews or focus groups) and respond with the appropriate level of questionnaire modifications. The discussion that follows attempts to provide approaches for questions of varying degrees of sensitivity, moving from slightly sensitive to extremely sensitive or illegal behaviors.

Reducing Threat through Question Wording. Sudman and Bradburn (1981) provide a checklist of question approaches to minimize threat from sensitive questions. Among the suggestions made by the authors are the use of open questions as opposed to closed question (so as to not reveal extreme response categories), the use of longer questions so as to provide context and indicate that the subject is not taboo, the use of alternative terminology (e.g, street language for illicit drugs), and embedding the topic in a list of more threatening topics, to reduce perceived threat, since threat or sensitivity is determined, in part, by the context.

Alternative Modes of Data Collection. For sensitive questions, one of the most consistent findings from the experimental literature indicates that the use of self-administered questionnaires results in higher reports of threatening behavior. For example, in studies of illicit drug use, the increase in reports of use were directly related to the perceived level of sensitivity, greatest for the reporting of recent cocaine use, less profound but still significant with respect to marijuana and alcohol use. Alternative modes could involve the administration of the questions by an interviewer with the respondent completing the response categories using paper and pencil or administration of the questionnaire through a portable cassette and self-recording of responses. More recently, face to face data collection efforts have experimented with computer assisted self administration (CASI) in which the respondent reads the questions from the computer screen and directly enters the responses and audio CASI, in which the questions can be heard over headphones as well as read by the respondent. The later has the benefit of not requiring the respondent to be literate and can be programmed to permit efficient multilingual administration without requiring multilingual survey interviewers. In addition, both computer-assisted approaches offer the advantage that complicated skip patterns, not possible with paper and pencil self administered questionnaires, can be incorporated into the questionnaire. Similar methods are possible in telephone surveys, with the use of push-button or voice recognition technology for the self-administered portion of the questionnaire.

Randomized Response and Item Count Techniques. Two techniques described in the literature provide researchers with a means of obtaining a population estimate of an event or behavior but not information which can be associated with the individual. Both were initially designed for use in face to face surveys; it is feasible to administer an item count approach in a telephone or self-administered questionnaire. The randomized response technique, is one in which two questions are presented to the respondent, each with the same response categories, usually yes and no. One question is the question of interest; the other is a question for which the distribution of the responses for the population is known. Each question is associated with a different color. A randomized device, for example, a box containing beads of different colors, indicates to the respondent which of the questions to answer, for which he or she simply states to the interviewer either "yes" or "no." The probability of selecting the red bead as opposed to a blue bead is know to the researcher. An example will illustrate. A box contains 100 beads, 70 percent of which are read, 30 percent of which are blue. When shaken, the box will present to the respondent one bead (only seen by the respondent). Depending upon the color, the respondent will answer one of the following questions: (Red question) Have you ever had an abortion? and (Blue question) Is your birthday in June? In a survey of 1,000 individuals, the expected number of persons answering “yes” to the question about the month of the birthday is approximately 1000(.30)/12 or 25 persons (assuming birthdays are equally distributed over the twelve months of the year). If 200 persons said "yes" in response to answering either the red or blue questions, then 175 answered yes in response to the abortion item, yielding a population estimate of the percent of women having had an abortion as 175/(1000*.70) or 25 percent.

The item count method is somewhat easier to administer than the randomized response technique. In the item count method, two almost identical lists of behaviors are developed; in one list k behaviors are listed and in the other list, k +1 items are listed, where the additional item is the behavior of interest. Half of the respondents are administered the list with k items, the other half are offered the list with the k +1 behaviors. Respondents are asked to simply provide the number of behaviors in which they have engaged (without indicating the specific behaviors). The difference in the number of behaviors between the two lists provides the estimate of the behavior of interest.

The major disadvantage of either the randomized response technique or item count method is that one cannot related individual characteristics of the respondents with the behavior of interest; rather one is limited to a population estimate.

Conclusions

The empirical literature provides evidence of both reasonably accurate reporting of earnings, other sources of income, and employment, as well as extremely poor reporting on the part of household respondents. The magnitude of measurement error in these reports is in part a function of the task as framed by the question; for example, we see evidence of very different quality of reports of rehearsed information (e.g., annual earnings) or those requiring estimation (e.g, annual hours of unemployment) as compared to reports requiring the retrieval of episodic information (reporting and dating of individual spells of unemployment). Careful questionnaire construction and thorough testing of questions and questionnaires can effectively identify question problems and reduce sources of error.

More difficult to address in the design of the questionnaire are those errors that arise as a function of the complexity of the behavioral experience of the respondent. However, even here, there are options for the questionnaire designer to consider. Although some of these methods, for example the use of event history calendars, may come at a cost of increased administration time, the increase in data quality may easily justify the costs. The ever present dilemma, that is the tradeoff between survey costs and survey errors, is of even greater concern when surveying the population of welfare recipients.

FOOTNOTES

1. Note that although statistical and economic theory provide the foundation for analysis of error-prone data, these disciplines provide little theoretical foundation for understanding the source of the measurement error nor the means for reducing measurement error. The discussion presented here will be limited to a review of cognitive and social psychological theory applicable to the measures of interest in understanding the welfare population.

2. Note that the study was conducted in England.

3. The CPS is collected each month from a probability sample of approximately 50,000 households; interviews are conducted during the week of the month containing the 19th day of the month and respondents are questioned concerning labor force status for the previous week, Sunday through Saturday, which includes the 12th of the month. In this way, the data are considered the respondent’s current employment status, with a fixed reference period for all respondents, regardless of which day of the week they are interviewed. In addition to the core set questions concerning labor force participation and demographic characteristics, respondents interviewed in March of any year are asked a supplemental set of questions (hence the name March supplement) concerning income recipiency and amounts, weeks employed, unemployed and not in the labor force, and health insurance coverage for the previous calendar year.

References

Aquilino, W. (1994) "Interview Mode Effects in Surveys of Durg and Alcohol Use." Public Opinion Quarterly, 58: 210-240.

Aquilino, W. and LoSciuto, L. (1990) "Effect of Interview Mode on Self-Reported Drug Use." Public Opinion Quarterly, 54: 362-395.

Barron, J., Berger, M., and Black, D. (1997) On the Job Training. Kalamazoo, MI: W.E. Upjohn Institute for Employment Research.

Belli, R. (1998) "The Structure of Autobiographical Memory and the Event History calendar: Potential Improvements in the Quality of Retrospective Reports in Surveys." Memory, 6(4):383-406.

Belsen, T. (1981) The Design and Understanding of Survey Questions. Aldershot, England: Gower Publishing Company.

Bogen, K. (1995) "Results of the Third Round of SIPP Cognitive Interviews." Unpublished manuscript, U.S. Bureau of the Census.

Borus, M. (1966) "Response Error in Survey Reports of Earnings Information." Journal of the American Statistical Association, 61: 729-738.

Borus, M. (1970) "Response Error and Questioning Technique in Surveys of Earnings Information." Journal of the American Statistical Association, 65: 566-575.

Bound, J. and Krueger, A. (1991) "The Extent of Measurement Error in Longitudinal Earnings Data: Do Two Wrongs Make a Right? Journal of Labor Economics, 9:1-24.

Bradburn, N. (1983) "Response Effects." in P. Rossi, J. Wrigth, and A. anderson (eds.) Handbook of Survey Research. New York: Academic Press.

Burton, S. and Blair, E. (1991) "Task Conditions, Response Formation Processes, and Response Accuracy for Behavioral Frequency Questions in Surveys." Public Opinion Quarterly, 55:50-79.

Cannell, C., Fisher, and Bakker (1965) "Reporting of Hospitalization in the Health Interview Survey." Vital and Health Statistics, Series 2, No. 6. Washington, D.C.: U.S. Government Printing Office.

Cannell, C., Marquis, K., and Laurent, A. (1977) "A Summary of Studies of Interviewing Methodology." Vital and Health Statistics, Series 2, No. 69. Washington, D.C.: U.S. Government Printing Office.

Cannell, C., Miller, P. and Oksenberg, L. (1981) "Research on Interviewing Techniques." in S. Leinhardt (ed.) Sociological Methodology. San Francisco: Jossey-Bass.

Carstensen, L. and Woltman, H. (1979) "Comparing Earnings Data from the CPS and Employers Records." Proceedings of the Section on Social Statistics. Alexandria, Va: American Statistical Association.

Cash, W. and Moss, A. (1969) "Optimum Recall Period for Reporting persons Injuired in Motor Vehicle Accidents." Vital and Health Statistics, Series 2, No. 50. Washington, D.C.: U.S. Government Printing Office.

Crowder, R. (1976) Principles of Learning and Memory. Hillsdale, NJ: Lawrence Erlbaum Associates.

Cynamon, M. and Camburn, D. (1992) "Employing a New Technique to Ask Questions on Sensitive Topics." paper presented at the annual meeting of the National Field Directors confrerece, ST. Petersburg, FL.

David, M. (1962) "The Validity of Income Reported by a Sample of Families Who Received Welfare Assistance During 1959." Journal of the American Statistical Association, 57: 680-685.

Dibbs, R., Hale, A., Loverock, R., and Michaud, S. (1995) Some Effects of Computer Assisted interviewing on the Data Quality of the Survey of Labour and Income Dynamics. Statistics Canada: SLID Research Paper Series, No. 95-07.

Dodge, R. (1970) "Victim Recall Pretest," Unpublished Memorandum, Washington, D.C.: U.S. Bureau of the Census. [Cited in R. Groves 91989)]

Duncan, G. and Hill, D. (1985) "An Investigation of the Extent and Consequences of Measurement Error in Labor-Economic Survey Data." Journal of Labor Economics, 3: 508-532.

Forsyth, B. and Lessler, J. (1991) "Cognitive Laboratory Methods: A Taxonomy" in P. Biemer, R. Groves, L. Lyberg, N. Mathiowetz, and S. Sudman (eds.) Measurement Error in Surveys. New York: John Wiley and Sons

Freedman, D., Thornton, A., Camburn, D., Alwin, D. and Young-DeMarco, L. (1988) "The Life History Calendar: A Technique for Collecting Retrospective Data." in C. Clogg (ed.) Sociological Methodology. San Francisco: Jossey-Bass.

Gems, B. Gosh, D., and Hitlin, R. (1982) "A Recall Experiment: Impact of Time on Recall of Recreational Fishing Trips." Proceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association.

Goodreau, K., Oberheu, H., and Vaughan, D. (1984) "An Assessment of the Quality of Survey Reports of Income from the Aid to Families with Dependent Children (AFDC) Program." Journal of Business and Economic Statistics, 2: 179-186.

Grondin, C. and Michaud, S. (1994) "Data Quality of Income Data Using computer Assisted Interview: The Experience of the Canadian Survey of Labour and Income Dynamics." Proceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association.

Groves, R. (1989) Survey Costs and Survey Errors. New York: John Wiley and Sons.

Halsey, H. (1978) "Validating Income Data: Lessons from the Seattle and Denver Income Maintenance Experiment." Proceedings of the Survey of Income and Program Participation Workshop-Survey Research Issues in Income Measurement: Field Techniques, Questionnaire Design and Income Validation. Washington, D.C.: Department of Health, Education, and Welfare.

Hansen, M., Hurwitz, W., and Bershad, M. (1961) "Measurement Errors in Censuses and Surveys," Bulletin of the International Statistical Institute, 38: 359-374.

Hoaglin, D. (1978) "Household Income and Income Reproting Error in the Housing Allowance Demand Experiment." Proceedings of the Survey of Income and Program Participation Workshop-Survey Research Issues in Income Measurement: Field Techniques, Questionnaire Design and Income Validation. Washington, D.C.: Department of Health, Education, and Welfare.

Horvath, F. (1982) "Forgotten Unemployment: Recall Bias in Retrospective Data." Monthly Labor Review, 105:: 40-43.

Jones, E. and Forrest, J. (1992) "Underreporting of Abortions in Surveys of U.S. Women: 1976 to 1988." Demography, 29: 113-126.

Kish, L. (1965) Survey Sampling. New York: John Wiley and Sons.

Livingston, R. (1969) "Evaluation of the Reporting of Public Assistance Income in the Special Census of Dane County, Wisconsin: May 15, 1968." Proceedings of the Ninth Workshop on Public Welfare Research and Statistics.

Levine, P. (1993) "CPS Contemporaneous and Retrospective Unemployment Compared." Monthly Labor Review, 116: 33-39.

Loftus, E. and Marburger, W. (1983) "Since the Erupstion of Mt. St. Helens, Has Anyone Beaten You Up? Improving the Accuracy of Retrospective Reports with Landmark Events." Memory and Cognition, 11:114-120.

London, K. and Williams, L. (1990) "A Comparison of Abortion Underreporting in an In-Person Interview and Self-Administered Question." paper presented at the annual meeting of the Population Association of America, Toronto.

Lyberg, L. and Kasprzyk, D. (1991) "Data Collection Methods and Measurement Error: An Overview." in P. Biemer, R. Groves, L. Lyberg, N. Mathiowetz, and S. Sudman (eds.) Measurement Error in Surveys. New York: John Wiley and Sons.

Marquis, K. and Moore, J. (1990) "Measurement Errors in SIPP Program Reports." Proceedings of the Annual Research Conference, Washington, D.C.: U.S. Bureau of the Census.

Mathiowetz, N. (1986) "The Problem of Omissions and Telescoping error: New Evidence from a Study of Unemployment." Proceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association.

Mathiowetz, N. and Duncan, G. (1988) "Out of work, Out of Mind: Response Error in Retrospective Reports of Unemployment." Journal of Business and Economic Statistics, 6: 221-229.

Maynes, E. (1968) "Minimizing Response Errors in Financial Data: The Possibilities." Journal of the American Statistical Association, 63: 214-227.

Mellow, W. and Sider, H. (1983) "Accuracy of Response in Labor Market Surveys: Evidence and Implications," Journal of Labor Economics 1: 331-344.

Menon, G. (1993) "The Effects of Accessibility of Information in Memory on Judgments of Behavioral Frequencies." Journal of Consumer Research, 20: 431-440.

Menon, G. (1994) "Judgments of Behavioral Frequencies: Memory Search and Retrieval Strategies." in N. Schwarz and S. Sudman (eds.) Autobiographical Memory and the Validity of Retrospective Reports. New York: Springer-Verlag.

Moore, J., Marquis, K. and Bogen, K. (1996) The SIPP Cognitive Research Evaluation Experiment: Basic Results and Documentation. Unpublished U.S. Bureau of the Census Report.

Moore, J., Stinson, L., and Welniak, E. (1999) "Income Reporting in Surveys: Cognitive Issues and Measurement Error." in M. Sirken, D. Herrmann, S. Schechter, N. Schwarz, J. Tanur, and R. Tourangeau (eds.) Cognition and Survey Research. New York: John Wiley and Sons.

Morganstern, R. and Bartlett, N. (1974) "The Retrospective Bias in Unemployment Reporting by Sex, Race, and Age." Journal of the American Statistical Association, 69: 355-357.

Murphy, L. and Cowan, C. (1976) "Effects of Bounding on Telescoping in the National Crime Survey." Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association.

Neter, J. and Waksberg, J. (1964) "A Study of Response Errors in Expenditure Data from Household Interviews." Journal of the American Statistical Association, 59: 18-55.

Oberheau, H. and Ono, M. (1975) "Findings from a Pilot Study of Current and Potential Public Assistance Recipients included in the current Population Survey." Proceedings of the Section on Social Statistics. Alexandria, Va: American Statistical Association.

O’Reilly, J. , Hubbard, M., Lessler, J., Biemer, P., and Turner, C. (1994) "Audio and Video Computer Assisted Self-Interviewing: Preliminary Tests of New Technology for Data Collection." Journal of Official Statistics 10: 197-214.

Presser, S. and Blair, J. (1994) "Survey Pretesting: Do Different Methods Produce Different Results?" Sociological Methodology. San Francisco: Jossey-Bass.

Rodgers, W., Brown, C., and Duncan, G. (1993) "Errors in Survey Reports of Earnings, Hours Worked, and Hourly Wages." Journal of the American Statistical Association,88: 1208-1218.

Schaeffer, N. (1994) "Errors of Experience: Response Errors in Reports about Child Support and Their Implications for Questionnaire Design." in N. Schwarz and S. Sudman (eds.) Autobiographcial Memory and the Validity of Retrospective Reports. New York: Springer-Verlag.

Stinson, L. (1997) "The subjective Assessment of Income and Expenses: Cognitive Test Results." Unpublished manuscript, U.S. Bureau of Labor Statistics.

Sudman, S. and Bradburn, N. (1973) "Effects of Time and Memory Factors on Response in Surveys." Journal of the American Statistical Association, 68: 805-815.

Sudman, S. and Bradburn, N. (1982) Asking Questions: A Practical Guide to Questionnaire Design. San Francisco: Jossey-Bass.

Sudman, S., Bradburn, N., and Schwarz, N. (1996) Thinking About Answers: The application of Cognitive Processes to Survey Methodology. San Francisco: Jossey-Bass.

Tourangeau, R. (1984) "Cognitive Sciences and Survey Methods." in T. Jabine, M. Straf, J. Tanur, and R. Tourangeau (eds.) Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines. Washington, D. C.: National Academy Press.

Tourangeau, R. and Smith, T. (1996) "Asking Sensitive Questions: The Impact of Data Collection Mode, Question Format, and Question Context." Public Opinion Quarterly, 60: 275-304.

Tourangeau, R., Rips, L, and Rasinski, K (forthcoming) The Psychology of Survey Response. Cambridge: Cambridge University Press.

U.S. Bureau of the Census (1979) "Vocational School Experience: October, 1976." Current Population Reports Series P-70, No. 343. Washington, D.C.: U.S. Government Printing Office.

Yen, W. and Nelson, H. (1996) "Testing the Validity of Public Assistance Surveys with Administrative Records: A Validation Study of Welfare Survey Data." paper presented at the Annual Conference of the American Assocation for Public Opinion Research.

Warner, K. (1978) "Possible Increases in the Underreporting of Cigarette Consumption." Journal of the American Statistical Association, 73: 314-318.

RSS News Feed | Subscribe to e-newsletters | Feedback | Back to Top