|
Models for Multi-Level State Science Assessment Systems
Edys Quellmalz
SRI International
and
Mark Moody
August 6, 2004
Paper commissioned by the Committee on Test Design for K-12 Science Achievement
Center for Education
National Research Council
The seminal discussion of the Classroom-Focused model occurred at the March, 2002 meeting of the Maryland Department of Education’s Psychometric Council. We gratefully acknowledge the contributions of Bert Green, PhD, Emeritus, Johns Hopkins University; Huynh Huynh, PhD, University of South Carolina; William D. Schafer, PhD, Emeritus, University of Maryland; and Steven Wise, PhD, James Madison University. In addition, Robert W. Lissitz, PhD, University of Maryland and Lawrence M. Rudner, PhD, Graduate Management Admissions Council contributed to the lively discussion.
Contributions to reviews of science and assessment research and practice and the SCALE Tech model were provided by Geneva Haertel, Angela Haydel, and Patty Kreikemeier, of the Center for Technology in Learning at SRI International.
Models for Multi-Level State Science Assessment Systems
Introduction
Under the requirements of the No Child Left Behind (NCLB) legislation, assessment of student achievement is designed as an accountability tool to help ensure that all students are appropriately instructed and achieve proficiency on state defined standards. This purpose requires assessment instruments that adequately represent the content and strategies set forth in standards and that accurately measure individual performance. The aggregation of individual results on the basis of subgroup membership is the primary accountability measure under NCLB. A major challenge is to provide meaningful information to educators to inform instructional improvement at the program and classroom level. In addition, the results should inform students and their parents of the student’s individual achievement level.
Purpose
This report was commissioned by the National Research Council Committee on Test Design for K-12 Science to create models for a state system of science assessments. Our design-team was asked to describe a set of strategies that states might consider for the design of their science assessment systems. Our report, “Models for Multi-Level State Science Assessment Systems” addresses the requirements of NCLB, and, at the same time, offers approaches for designing rigorous, sound, coherent state science assessment systems that harness the resources of state collaboration and the power of technology. The models aim to articulate and integrate science assessment methods across the classroom, district, and state levels of the system. The report describes core features of coherent, integrated, multi-level assessment systems that could be combined in numerous ways. We propose two models that could be built incrementally as they fit within state policies and resources. Stages of model development may be influenced by: 1) the extent and nature of collaboration with other states to leverage resources, 2) the breadth and sophistication of technology supports, and 3) the uses of the assessment results at different levels of the system.
The models are based on analyses of research and practice on assessment and science learning. Our sources included publications, conference presentations, Web sites, and interviews with assessment and collaborative program personnel.
Issues in the Assessment of Science
Limitations of Current Assessment Practice. National professional organizations have led the way for states to set challenging standards for what students should know and be able to do in science (NSES, 1999, AAAS, 1993). These standards call for deep conceptual understanding and extended inquiry. However, it is widely recognized that progress on challenging standards is not being documented by current testing practices. Traditional tests do not reflect enriched views of learning based on contemporary cognitive research (Pellegrino, Chudowsky & Glaser, 2001; Bransford, Brown & Cocking, 2000). Large-scale tests typically favor breadth of content coverage over depth of reasoning and employ decontextualized items, primarily using the multiple-choice format.
A recent study of the alignment of three prominent science reference exams revealed uneven, scant coverage of most of the National Science Education Standards (NSES) for Science as Inquiry (Quellmalz & Kreikemeier, 2004). Within those exams, the only inquiry standard judged to be tested by the multiple choice format related to use of mathematics. Clearly, the deep learning and complex performances needed to provide evidence of achievement of challenging science standards will require use of multiple formats, including constructed response items and performance assessments.
Contributions of Performance Assessments. Performance assessments are seen by some reform advocates as approaches that can provide strong evidence of advances in student learning (Pellegrino, Chudowsky, & Glaser, 2001; Glaser & Silver, 1994; Quellmalz & Zalles, 2002). In performance assessments, students explain how they plan and conduct experiments, gather and organize data, analyze and interpret findings, and communicate results and conclusions. Researchers make the point that performance assessments based on theories of knowledge development can make cognitive activity and effort visible, therefore serving as catalysts for constructive teaching by providing opportunities for reasoning to be examined and questioned. Furthermore, technology-supported assessments can present complex, authentic performance tasks well suited to measuring such outcomes, as well as students’ growing prowess with science and math learning tools.
Understanding the central role that performance assessment plays in standards-based reform, educators seek ways to use these assessments to test student learning. Experience indicates, however, that the level of effort and costs are very high to develop performance assessment tasks and scoring rubrics, train raters to score reliably, and ensure technical quality through field testing and expert review (Pellegrino, Chudowsky, & Glaser, 2001; CPRE, 1995; Quellmalz, 1984). With the economical constraints and logistical demands currently facing state assessment programs, finding ways to incorporate a range of assessment formats, particularly the development and use constructed response and performance assessment formats for accountability present a significant challenge.
Need for Professional Development on Science Assessment. Most large-scale assessments have ambiguous relationships to the day-to-day business of teachers and students. As a consequence large-scale accountability assessments contribute little to the quality of instruction and student learning, and can, in fact, narrow curricula to the factual information many tests emphasize. Formative, classroom assessments, on the other hand, have had impressive, positive effects on student learning (Black& Wiliam, 1998).
Classroom teachers, who bear the burden of the misalignment of standards and prevailing assessment methods, need support in the alignment, use, customization, and development of sensitive assessment approaches that will reflect student growth in rigorous science curricula. In general, assessment literacy is sadly lacking in teacher education pre-service courses and in professional development for classroom teachers (Stiggins, 2002).
The Role of Technology Supports for Science Assessment. The powerful capabilities of technologies can be harnessed to reform the assessment practices used to document that state science standards have been met. As technologies revolutionize so many facets of the way we work and live, innovative applications are making increasingly significant contributions to student learning and assessment. The affordances of technology hold great promise for addressing many of the challenges for assessing deep science learning. Technology tools can support assessment functions such as system design, alignment of assessments with standards, test authoring and assembly, access to collections of exemplary assessments, online administration in conventional or adaptive modes, automated scoring, technology-based rater training and scoring, data analysis, psychometric analysis, and tailored reporting for multiple audiences.. Dramatic reductions in the costs and logistics of paper pencil testing are making technology-based assessments an attractive alternative to traditional print forms (Bennett, 2004; Neuberger, 2004).
Moreover, technologies have fundamentally transformed the field of science and are playing an increasingly central role in inquiry-based science education. Students can access vast reservoirs of digital information and data, employ powerful analysis tools, and use a range of representational forms to understand and communicate. Increasingly, the tools of researchers and scientists are being made accessible to K-12 students. The use of technology expands both the nature of the content that can be presented in an assessment and the knowledge, skills, and cognitive processes that can be assessed (Pellegrino, et. al.,, 2001; Means, Penuel, & Quellmalz, 2001). Many of these technologies can be used to elicit, collect, document, analyze, appraise and display kinds of student performances that have not readily accessible through traditional testing formats. Technologies can present complex system environments such as local and global ecosystems and probe dynamic inquiries as students test and revise their predictions within modeling tasks. Technologies can support fundamental changes in the range of student outcomes that can be assessed as well as how the outcomes are tested.
Risks of Limited Forms of Assessments. The design of sound state science assessment systems will need to address the limitations of typical large-scale assessment practices. Restricted forms of assessment can result in a range of negative, unintended consequences. The state may not have access to data on students’ achievement on the full range of content and inquiry called for in state standards. Administrators may not have data relevant for informing decisions about use of science curricula likely to promote deep science understanding and processes. Professional development programs may not have access to data that can identify areas needed for strengthening pedagogical strategies. Most importantly, teachers and students may not have access to assessment data and items and tasks that provide feedback on progress. Thus, to collect data on the full range of challenging science standards targeted, states will need to incorporate and connect a range of assessment formats and also to articulate testing at multiple levels of the educational system.
Overview of Features of Multi-level Science Assessment System Designs
The models of multi-level state science assessment systems described in this report share a set of fundamental components.
1. The assessment systems leverage the power of collaboration to share the costs and logistics of fully developing, maintaining, and articulating science assessments
2. The systems develop explicit plans for connecting and integrating assessment designs and results gathered from multiple levels of the system.
3. The systems place an emphasis on the use of science assessment for learning, i.e., for diagnosis at the classroom level..
4. The assessment systems draw upon the capabilities of technologies to support assessment design, administration, scoring, interpretation, and assessment literacy
5. The science assessment systems document alignment of current and planned assessment tasks and items with content and performance standards. These standards are further specified to define the knowledge, skills, and strategies and levels of performance that comprise the assessment targets of a student outcome model.
6. The systems employ common task and item specifications to shape pools of tasks and items that can be accessed for assessments at multiple levels of the system and that will elicit and link evidence of achievement of science standards.
7. The systems develop strategies for sampling from the collections to build and connect test forms at different levels of the system.
8. The science assessment systems design and implement professional development on assessment, item and task development, administration, and use of results.
Below, we describe each of the components in greater detail.
Collaborators with shared goals and synergistic levels of effort. The models presented in this report assume that several states seek to develop or enhance their state science assessment system, but lack the resources to do so on their own. Assessment collaboratives offer the opportunity for sharing not only costs, but also labor and expertise. Such state assessment collaboratives for science can learn from the experience of other assessment collaboratives. The assessment collaboratives have encountered a number of challenges that will need to be addressed by new efforts (Fabrizio, 2004; Barth, 2004). First, collaborative leadership and management need to be established. Options may include contracting with an external organization or assigning state assessment staff to lead and manage the development, field testing, and technical quality procedures. A central issue will be to reach an agreement on the science standards, framework, or objectives for assessments to be developed. Coalition members must decide upon commonly shared assessment development priorities and shared descriptions of science standards and component skills. Another challenge relates to decisions about the formats of tasks and items to be developed and the particular specification shells to guide the development. Each collaborative must prepare policies for use of the assessments by members and non-members, particularly the policies for parts of the collections that will be secure and those that will be released. In addition, collaboratives must agree upon member funding requirements, contracts for the preparation of related assessment materials, expectations for sharing the assessments with non-members and for maintaining and updating the shared assessments.
Vertical articulation of assessment system components. The multi-level models in this report rest on the premise that the articulation of assessments is based on alignment of assessments at each level with national and state science standards and by shared design specifications for assessment items and tasks that will be used at the various system levels. Therefore, in-depth probes of student conceptual understanding and inquiry in science at the classroom level will share types of tasks and items parallel to those employed in district and state testing. In general, to serve formative, diagnostic purposes, classroom assessments would incorporate more items and complex assessment tasks for science standards than would be in test forms developed for district and state summative, accountability purposes. However, each level would employ science assessment items and tasks that are common or parallel to those used at the other levels. The construction of assessment forms would vary at each level of the educational system, as would expectations for interpretation and use of the assessment results. Figure 1 presents a conceptual framework to illustrate the relationships among assessment system levels, item formats, and utility of assessment information for decisions. In the Item Format columns, the more filled the circle, the more prevalent the use of that format at that level. In the Information Utility columns, the more filled the circle, the more useful data from assessments at that level would be for decisions about classroom students and teachers, programs, or policy. Thus, at the classroom level, assessments should consist of a greater proportion of constructed responses (CR) and performance assessments (PA) in relation to selected responses (SR), as these item formats elicit rich student responses that allow teachers to better understand each student’s strengths and needs. This information is very useful for student-level decision making, moderately useful for program-level decisions, and less critical for informing policy. As the level of aggregation increases from classroom to the state, the relative utility of the in-depth student information provided by constructed responses and performance assessments for decisions about programs and policies decreases while the utility of less sensitive selected response item and task formats for program and policy information increases. At the same time, rich, diagnostic information elicited by constructed responses is far less prevalent in tests more distal to the classroom due to a greater reliance on data obtained from selected response items that cover more of the domain, in less depth. In the multi-level articulated model, the unit of analysis at each level is designed to maximum the usefulness of the results for the information needs at that level. As discussed earlier, even state-level science assessment must the balance efficiency and economy of over-reliance on selected response formats with the risk of narrowing the emphases of professional development and curriculum policy decisions. The multi-level models presented in this report attempt to offer strategies for balancing testing costs, multiple formats, and information utility.

Figure 1. Item Formats and Information Utility
Collaborators with shared goals and synergistic levels of effort. The resources required to implement the models presented in this report assume that several states seek to develop or enhance their state science assessment system, but lack the means to do so on their own. Assessment collaboratives offer the opportunity for sharing not only costs, but also labor and expertise. Such state assessment collaboratives for science can learn from the experience of other assessment collaboratives. The assessment collaboratives have encountered a number of challenges that will need to be addressed by new efforts (Fabrizio, 2004; Barth, 2004). First, collaborative leadership and management need to be established. Options may include contracting with an external organization or assigning state assessment staff to lead and manage the development, field testing, and technical quality procedures. A central issue will be to reach an agreement on the science standards, framework, or objectives for assessments to be developed. Coalition members must decide upon commonly shared assessment development priorities and shared descriptions of science standards and component skills. Another challenge relates to decisions about the formats of tasks and items to be developed and the particular specification shells to guide the development. Each collaborative must prepare policies for use of the assessments by members and non-members, particularly the policies for parts of the collections that will be secure and those that will be released. In addition, collaboratives must agree upon member funding requirements, contracts for the preparation of related assessment materials, expectations for sharing the assessments with non-members and for maintaining and updating the shared assessments.
Alignment of assessments with science standards. The collaborative members will need to identify assessment development priorities related to the NSES. Most state standards have been aligned to the NSES, allowing the NSES to serve as a general, common framework for discussions of science outcomes needing development by the collaborative. The state science standards will be the backbone connecting multi-level assessments test form constructions within a state. Since both NSES and state standards may be quite broad, the agreed upon standards will need further delineation of component concepts and processes to target in task specifications.
Development and maintenance of pools of quality items/tasks. A primary task of collaboratives will be to assemble and develop banks or collections of science tasks and items that can be accessed and incorporated into assessments at different levels of the system. The collaboratives will need to collect and analyze the items and tasks the states separately and collectively have on hand, align them to the NSES, agree upon the types of items and tasks desired for eliciting evidence of achievement of the standards, and decide on the numbers of tasks and items necessary to stock and replenish the pools. An initial needs analysis will inform the priorities for joint development and decisions about the work plan for developing the collections/item banks. The assessment development process would follow established methodologies for item and test preparation and technical quality (AERA/APA/NCME, 2002).
Specifications for accountability and classroom tests. . The development effort will need to include plans for assembling items/tasks from the pools into test forms that are valid and reliable data collections of achievement for uses of the data at different levels of the system. Common task specification shells will ensure that tasks and items in the secure and released collections are parallel. Tasks and items will move from the secure to the released pool as assessments are exposed after multiple administrations. The strategy for releasing items would need to be determined by the collaborative leadership. For example, one strategy is to release the equivalent of one form per state per year. Another could be to build a large secure item pool so that items sit out for a couple of years before they are reused. The large item pool reduces the risk of compromised items impacting any given testlet. Another strategy would be to release all the items every year but leave them in the pool for reuse. If the pool were large enough, the likelihood that students or teachers would be advantaged on any one administration would be very low. And, if in fact, teachers used all of the known items and tasks for formative assessment on standards being addressed in curricula, the effects on student learning should be positive.
Professional development. A coherent assessment system must include professional development on the administration, use of results, selection, adaptation, and creation of new tests. The professional development should be tailored for assessment staff, administrators, and teachers.
Instructional Utility. It is essential that teachers understand how to use the information from formative and summative assessments to diagnose student strengths and weaknesses and to shape instructional strategies. To do so, teachers must understand the characteristics of proficiency in relation to instructional tasks and how to provide students with opportunities to master the material.
Technology. Technologies will play key roles in multi-level science assessment systems. Technologies can support the logistics of an assessment system related to aligning assessments with standards, designing assessment tasks, housing collections, delivering assessments online, hosting online rater training and scoring, and tailoring assessment reports. More importantly, technology-based science assessments can present rich, complex tasks, permit dynamic, extended inquiry, log problem solving pathways, collect innovative kinds of student responses, and display them. Models described in this report incorporate strategies for phasing in technology supports.
Science Assessment System Model Components
There can be considerable variability in the ways that multi-level science assessment models implement system components.. State collaboration may be minimal or extensive. The integration of assessment designs and results at different levels of the educational system may be advanced or in the initial stages. The development goals for task/item formats may be limited or varied. Plans for test assembly may be for common forms across states or for variable forms to be determined within a state. The use of technology supports may be in the initial stages, for a few assessment functions, or quite fully developed.
Below we describe two potential assessment system models with components that vary on these dimensions. The models are intended to serve as examples and to provide descriptions of how the models would be led, designed, and implemented. Other combinations are possible.
The second model, Classroom Focused Multi-Level Assessment Model, also involves a state collaborative that produces pools of multiple formats, selected, constructed and performance assessment tasks. In this model, the description focuses on test design and psychometric issues related to using the shared pool to provide diagnostic student information for classroom teachers as the foremost priority. Methods are also elaborated for constructing test forms that can serve accountability purposes by incorporating common items in the classroom, district, and state levels. This model requires advanced technology for tailored, just-in-time, online test assembly and curriculum-embedded administration.
State Coalition for Assessment of Learning Environments Using Technology (SCALE Tech)
SCALE Tech is one possible model for enhancing state science assessment systems (see Figure 2.). In the SCALE Tech model, states and districts would form a network committed to developing collections of science items and tasks that will elicit evidence of the full range of challenging science standards and to employing technology to support the development and use of such assessments throughout the levels of the educational system. State coalition members would collaborate on the design and development of science assessment item and task collections that can provide evidence of student progress on standards. States might set development priorities that target hard-to-measure inquiry standards such as designing investigations or communicating an evidence-based argument. For example, the coalition could elect to focus shared development on performance assessments that ask students to actually conduct investigations wherein they pose questions, design investigations, collect, analyze, interpret, and explain data and evidence to support conclusions, and communicate findings.
SCALE Tech Leadership. To develop a multi-level science assessment system, the coalition would establish a leadership team composed of science assessment directors from each state. Initially, the SCALE Tech model would involve “ lighthouse districts” in states committed to testing challenging standards. These would be districts that are implementing reform curricula that promote deep science understanding and inquiry and with the interest and infrastructure to pursue technology-enhanced assessment. Assessment coordinators from each district would serve on the leadership team. The leadership team would be responsible for creating coalition policies on such issues as funding, distribution of development responsibilities, procedures to ensure technical quality of the assessments, development priorities, policies for access to the assessment collections, and development and use of technology supports.
SCALE Tech Collection Development Strategies. Leaders of the state coalition would set assessment development priorities based on analyses of existing sets of items and tasks linked to national and state science standards. Plans would need to include estimation of the size of collections needed for assessing the science standards. Criteria for setting development priorities might include: 1) the standard is considered a fundamental, big idea, 2) the standard is not appropriately assessed by types of items currently available, 3) the standard is important, but few items or tasks are available to assess it. Estimates for the needed size of the collections would be based on intended uses of secure and released items in district and state accountability tests and for classroom assessments. An essential role of the leadership team would be to establish and monitor procedures for ensuring the technical quality of items and tasks developed for the secure and non-secure pools and for test forms prepared for district and/or state accountability testing.
Assignments for district team development could be based on interests or specific expertise. Therefore, a team might specialize in the development of items and tasks in life science or in elementary level assessments. The coalition leadership team would plan and monitor the distribution of the development effort.
Access Policies. The SCALE Tech leadership team would determine policies for access to the secure and non-secure collections. A portion of the items and tasks developed could be held in a secure pool to be drawn from for inclusion in the state science test. The districts could draw from the same secure pool or one designed for district accountability tests. An open pool would be available for use by classroom teachers, leaders of professional development programs, curriculum programs. It is likely that the assembly of test forms for district or state accountability testing would fall within the prevue of individual state assessment programs. Use of the non-secure pools for classroom teachers would allow individual teachers to build their own formative tests. To promote classroom use of the collections, it would also be possible for the coalition to sponsor alignments of items and tasks in the standards-based collections to science curriculum programs widely used.
Collection Maintenance. A critical factor for the effectiveness of SCALE Tech will be the ongoing development of the collections to replenish items and tasks in the secure pool that have been exposed in repeated accountability administrations and released to the open pool. After the SCALE Tech model has been in operation for a few years, the coalition could involve new districts, and, perhaps, add states. Plans could be for ongoing engagement of districts statewide and across states in developing and using science assessments. An alternative might be to rotate districts on a to-be-determined cycle.
The benefits of the SCALE Tech model would occur at all levels of the educational system. Teachers, students and parents would have access to evidence from sound assessments of challenging science to guide instruction and monitor progress. State and district accountability programs would have access to multiple measures of achievement on the full set of science standards. As the science assessment system takes advantage of the affordances of technology, the logistics and costs of including a greater range of assessment formats, particularly assessments of complex performance, throughout the system will be greatly reduced, and, hence, more feasible and instructionally relevant.
Classroom Focused Multi-Level Assessment Model
• Are aligned with the content standards.
• Are organized in instructionally meaningful units.
• Can be delivered on demand.
• Provide timely diagnostic information for teachers and students.
• Allow measures to be combined over time and across levels.
• Support instructionally meaningful reporting.
• Contribute to valid and reliable measures for accountability.
The Classroom-focused model is designed to be implemented at a state level. States specify the blueprint of the assessment providing alignment between the assessment and their content standards, objectives, indicators, and assessment limits. School districts map their curricula to the blueprint and create assessment modules to fit the scope and sequence of instruction. Alignment of the components – standards, assessments, district curricula, and taught curricula – is accomplished at the state, district, and school levels through this process. The state blueprint is used by the test creation software to ensure appropriate coverage. School districts and schools can augment the coverage with additional assessment items to address their specific instructional goals. The core data for the state are derived from summative items drawn from the state item pool. Formative items are used by teachers, schools, and districts to inform instruction. For accountability purposes states define performance standards based on the state assessment blueprint. Student data are collected on these items and aggregated across modules to provide summative scores at the student level. To help ensure accuracy, we recommend an end--of-year formal assessment consistent with the blueprint for all students. The formal summative assessment would be a short test consisting of mix of item types, but consisting primarily of selected response items. The formal summative assessment can be linked to the aggregated summative items from each curriculum module as well as the formative items to establish the relationships among the various assessments.
Although item and task development would be shared enterprises, separate calibrations would be necessary for states in the collaborative to account for state specific content standards and/or assessment blueprints. Formative items should be simultaneously calibrated and thereby linked to the summative items. However, the value of the formative items is in their diagnostic utility that must be developed as a judgmental process among teachers, content, and cognitive experts. The development of diagnostic protocols would be guided by the collaborative but would be accomplished within states where the process could also serve as a teacher professional development (TPD) opportunity.
This model requires a substantial technological base. Item and task banks must be searchable with appropriate alignment tools. Modular tests must be created, delivered, and scored on-line for immediate feedback. The data must be managed at the item, student, and classroom level. While a vast majority of the software components are currently available, they will need to be adapted to fit the model requirements. In addition, the bandwidth demands of computer delivered assessments have implications for the communications technologies currently available in most schools. The technology problems are readily solved but do require a commitment of resources by participants.
In order to sustain the system, new formative and summative items would need to be developed, pilot tested and field tested on a continuing basis. The collaborative would continue to develop items and states could automatically include items for field testing in the modular assessments to obtain item parameters. New items would need to be calibrated with respect to the baseline year to ensure accurate measures of growth.
Development of Multi-Level Science Assessment Systems
At the heart of a multi-level science assessment system is a large pool of items and tasks that is readily available for use by classroom teachers, local assessment designers, and state assessment designers for their specific information needs. The pool must include both diagnostic (formative) and evaluative (summative) items. Formative items are designed to be used by teachers to yield student level information about strengths and needs with respect to specific content knowledge and cognitive skills. Formative items are intended to be used individually or combined into instructionally coherent sets to fit individual classroom circumstances. Individually, formative items must elicit useful diagnostic information and be supported by tools to help teachers interpret student responses in instructionally relevant ways. The ultimate measure of quality of formative items is determined by a cognitive analysis of student responses they elicit.
Summative items are designed to be included in more traditional test formats to provide decisions makers with information about student achievement levels. The items are designed to be components of tests or sub tests and are not intended to stand on their own. Combinations of summative items for evaluative purposes must conform to the AERA/APA/NCME standards for assessments to help ensure accurate decisions. The ultimate measure of quality of a summative item is its psychometric properties determined in the context of the test or sub test in which they are embedded.
To build and replenish the items requires a substantial commitment from the collaborative members. The commitment begins with an agreement on content standards, objectives, indicators and assessment limits. This is a prerequisite for entering into the collaborative. Since most states’ content standards are consistent with those developed by the NSES, we assume that agreement can be reached among willing participants. Developing item pools is a multi-step process of described below. Item development is a continuous process central to the collaborative. We envision that each state will engage teachers and content experts in an institutionalized process for contributing items to the
pool (see Figure 4). Newly developed or revised items would be systematically included in operational assessments as an integral part of the test construction process. Below we describe key steps in the assessment development process.

Identify priority standards and development needs for the collaborative: The first challenge for a collaborative is arriving at a consensus on standards for which a joint development effort should be mounted. We propose that states would examine the NSES in relation to their state standards and identify the commonalities at each grade level or grade range. States would then deliberate on the criteria for selecting development priorities and a plan for phasing in priority areas if they cannot all be addressed at once. The criteria might involve development for a few, central conceptual and/or inquiry areas, for subject matter areas for which there are gaps or thin coverage in currently available items, or for formats such as extended constructed response or performance assessments that are not currently available. The greater the overlap the more benefit of a collaborative effort. The present discussion will focus on the content described in the NSES and will be limited to these core standards. In our illustrations and examples we will focus on the inquiry (content standard A) and life science (content standard C) for the grade band 5 through 8 (see Boxes 2 & 3). The process we propose is applicable to any set of standards proposed by states or school systems.
Box 2: Inquiry 
Box 3: Life Science


Item specification for item development: The proposed multi-level models for science assessment call for an assessment system that takes into account the summative and formative purposes described above. The item specification process should start at the instructional level by unpacking the content standard indicators into instructionally meaningful components where the knowledge and/or skills and cognitive process dimensions of the content standard indicators are unambiguously defined. This is a judgmental process whereby content and cognitive experts along with teachers examine each indicator and achieve consensus on what knowledge and/or skill is called for and at what cognitive level students are expected to demonstrate they have acquired the knowledge and/or skill, essentially defining the characteristics of what constitutes proficiency for the indicator. For many content standard indicators there is an additional consideration of what can be fairly assessed, termed assessment limits. Assessment limits represent the ‘at-least-list’ for instruction and the ‘at-most-list’ for assessment in order to balance the competing demands for fairness in specifying what is to be tested and the unintended consequence these specifications may have on narrowing the is taught (Schafer & Moody, 2003). For example, the indicator may call for students to design and conduct a scientific investigation. Given the range of possibilities from simple to complex designs, it is reasonable to delimit an assessment item by specifying the number of independent and dependent variables. As a final step in the specification process, it should be noted if the indicator can be assessed through items based on selected response (SR), constructed response (CR), or performance assessment (PA) modes. (See Boxes 4 and 5 for examples of the item types). Thus, an item is described by the intended knowledge or skill of the indicator, the cognitive process(es) required for students to demonstrate proficiency, the limits on what is considered fair game for testing, and the preferred student response mode.
Box 4: Example of Constructed Response Item

Box 5: Examples of Selected Response, Constructed Response,

and Performance Assessment Items.
Table 1 is an illustration of an item descriptor table for the inquiry content standard. The nomenclature distinguishing among levels of specificity from more general to more specific is: Domain, Standard, Objective, Indicator, Assessment Limit. The item descriptor tables provide the specifications for item writers at the assessment limit level for each content standard indicator. The cognitive process dimension is an attempt to specify the depth of understanding required of students to successfully respond to the item. Since state assessment programs may employ different schemes for distinguishing the dimensions of conceptual complexity and inquiry, members the collaborative will need to agree upon a common classification scheme to be used in test design. Table 1 specifies the updated scheme originally proposed by Bloom (Anderson & Krathwohl, 2001). Different schemes may be selected by the state collaborative (Quellmalz, 1987).
Table 1. Sample Item Descriptor Table
|
|
|
|
|
|
|
|
|
|
|
Domain: Inquiry
|
|
|
Standard: As a result of activities in grades 5-8, all students should develop abilities to necessary to do scientific inquiry and understandings about scientific inquiry.
|
|
|
Objective: As a result of activities in grades 5-8, all students should develop abilities to necessary to do scientific inquiry.
|
|
|
Indicator: Design and conduct a scientific investigation
|
|
|
Assessment Limits: Two independent variables
|
Item descriptor matrix
|
Item Type
|
Cognitive Process
|
Remember
|
Understand
|
Apply
|
Analyze
|
Evaluate
|
Create
|
SR
|
|
|
|
|
|
|
CR
|
|
|
|
|
|
|
PA
|
|
|
|
|
|
|
Development/pilot testing process : Item development must conform to the AERA/APA/NCME standards for test development. Below we address issues unique to the multi-level models.
Task/Item specification shell. Item writing begins by providing the item writers with a thorough understanding of the NSES content standards for discipline areas and inquiry and the depth of understanding expected of students. The process can only proceed after content and cognitive experts have agreed on the item descriptor table specifications. Using the specifications, item writers populate the tables with proposed items.
Review for content quality and sensitivity. Proposed items are subjected to multiple reviews to ensure that they: conform to the specifications, are accurate with respect to the content knowledge, are grammatically correct, are fair for all students, and can be consistently administered in a variety of classroom setting. Committees of teachers and content experts with broad representation from special populations (students with disabilities and limited English proficiency) as well as gender and race/ethnicity conduct the reviews. Content and cognitive experts should also review formative items for their potential diagnostic value.
Pilot testing. All items require pilot testing on representative samples of students. At this level the purpose of pilot testing is to obtain preliminary item performance information prior to formal field testing. At this stage, the items need not be organized into tests or sub test and may be piloted individually with no fewer than 100 students. For CR and PA items the efficacy and accuracy of scoring rubrics are evaluated for the purpose of refining the rubrics, items or both. Representative formative items should include a cognitive analysis of no fewer than 10 students representing a range of performance levels. The cognitive analysis provides the basis for developing diagnostic protocols for interpreting responses and for establishing one form of construct validity.
Once items and tasks are pilot tested and approved for use, they go into to the item pool. Formative items are ready for use in the classroom since, the value of these items is depends on the item’s ability to elicit diagnostic information and not on its psychometric characteristics. The formative item pool would be considered open.
Summative items would be available for further field testing by states. The items would be considered secure. States could draw from the secure pool to assemble field test forms. It would be the states’ responsibility to formally field test the items in the context of their assessment blueprints. States would establish the item parameters using their preferred procedures and use the items according to their assessment specification procedures. The collaborative might request that each state provide its parameters for the item to the item bank.
Creating Test Forms: Assuring quality of an assessment depends on the alignment of the items in the content and cognitive domains expressed by the content standards. Test forms are created by selecting items from the pool according to a test blueprint or test map. The blueprint specifies the number and types of items at the objective or indicator level. The item distributions determine the relative contribution sub scores in relation to the total test score. To ensure coverage of the content standards, the test blueprint specifies the number of items, item types, and a sampling strategy for selecting items at the indicator level for test forms. State content standards are likely to differ in terms of both number of standards, emphasis, and the depth of understanding expected of students. As a consequence, states will have to develop unique test blueprints. States would be expected to follow the AERA/APA/NCME standards for test form creation.
Assessment system blueprint/combining measures at each level : For the multi-level articulated models proposed here, items are further classified as summative or formative. Summative items contribute to accountability scores while formative items are designed to support instruction. The commonalities across levels from state to local to classroom are based on items developed according to common specifications related to core components of the content standards that are present at each level. The differences are based on the unique components at each level.
Blueprint development is an iterative process where test developers balance test length and desired reporting level (content, domain, objective, or indicator) based on the acceptable level of accuracy of the results for student level reporting. By consensus members of the collaborative would decide details of the core blueprint: test length, reporting level, item counts at the objective and indicator level, weighting strategies, and item sampling rules.
Issues in setting performance standards: Setting performance standards is a state responsibility under NCLB. The performance standards require that student performance on assessments aligned with state content standards be categorized into at least three levels: basic, proficient, and advanced. The percentage of students performing at the proficient and advanced levels is the primary academic measure for sub groups, schools, school systems, and the state. Setting performance standards is a judgmental process relying on the expertise of educators. While there are a number of acceptable procedures, the fundamental concept is to set cut points on the assessment scale representing qualitatively distinct groups of students in terms of their mastery of the state’s content standards. The fact that states are likely to have different test blueprints means that they will have different scales for student scores. The cut points defining the performance categories set on these scales are unique to states. Furthermore, state definitions of the knowledge and depth of understanding that represent the performance categories are also likely to differ. Thus, performance standards and percentages of students in the classifications are not comparable across states.
Collaboration Issues
The most successful collaborative model is one in which the states agree to use the same content standards. This is ideal but is unlikely to be the reality of most states. At a general level, the states may be able to relate their science standards to the NSES and agree on NSES large categories for priority development. Agreements will need to be forged at the task specification level. Each state will need to sign off on item and task specifications that will define the features of sets of parallel items and tasks to be developed according to the specifications. Our view is that the goal is not a single assessment shared across states but an item pool from which unique assessments both formative and summative can be constructed by each state to meet their specific needs and standards. States would have the flexibility to create customized assessments drawing on the collaborative item pool and, if necessary, augmenting their assessment with state unique items. In this model, each state would assume the responsibility for the psychometrics of its own test, setting performance standards, and data reporting. States could submit the item statistics to the collection for reference by other states. The benefits of participating in a collaborative would be access to a large item pool, expert advisory committees, and the shared experience of others. Clearly, the benefits will be directly related to the number of commonly held content standards.
Common issues affecting success. Regardless of the magnitude of effort required, to be successful a collaborative must address a number of common issues. These include:
• Governance.
• Day-to-day management of activities.
• Consensus on the scope and duration of the effort.
• Consensus on the standards, framework, or objectives for assessments and component skills.
• The formats of tasks and items to be developed and the particular specification shells to guide the development.
• Policies for use of the products.
• Policies for secure and public access to products.
Risks to the collaborative success. Collaborative efforts unravel for a variety of reasons: some can be addressed, some cannot. For example, changes in the political climate are unpredictable and unavoidable. NCLB is a prime example. Multiple agendas and expectations of state leadership are difficult to predict and address. For example, shifts in state leadership often result in new state priorities and focus. Changes in economic conditions impact the capacities of member states to maintain commitments. These issues are unavoidable.
Issues that can be managed for collaborative success: Baseline elements for a successful Collaborative Organization are: a clear mission that meets the strategic and policy requirements of the participating states, realistic expectations of what is to be accomplished, and a governing board with decision-making authority from the participating states. These elements provide a framework for successful development and ongoing operation of the organization. Other issues that can be managed for collaborative success are:
• Governance
• Operational logistics
• Conflict resolution
• Continuous improvement
• Quality assurance of the products
• Timely delivery.
These are organizational dynamics that can be addressed by establishing appropriate structures and procedures.
Essential Elements of the Collaborative Organization
Governing Board. The success of the collaborative depends on a governing board consisting of state representatives with decision-making authority, the ability to commit state resources, and the political connections to directly influence state policy. The governing board would be responsible for developing policy with respect to the operation of the collaborative. The governing board would also be responsible for authorizing the scope of the collaborative mission and ensuring the financial resources for the viability of the Collaborative Organization. The Governing Board would work through a Collaborative Organization headed by the Collaborative Director.
Collaborative Director. The Collaborative Director is the Chief Executive Officer of the collaborative project operating with the guidance of the Governing Board. The director would have authority to make operational subject matter decisions without Board approval. He would have responsibility for day-to-day operations and the accomplishment of key operational requirements:
• The establishment and maintenance of expert advisory teams in relevant areas of expertise
• Ensuring the involvement of teachers in all aspects of product development
• Coordination with State Project Leads.
The Collaborative Director would also have responsibility for managing the budget and advising the board of financing requirements.
Expert Advisory Teams. Expert advisory teams provide the expert advice and guidance to the project that ensures that the products meet agreed upon standards. These teams would include nationally recognized experts who help design, monitor and implement the project. The actual work could be done through contractors or by working teams of experts and teachers contributed by the states.
Teacher Involvement. A key element of the successful Collaborative Organization is teacher involvement in all aspects of product development. The goals achieved from teacher involvement are maximizing the benefit of professional development afforded by the collaboration with experts, and the resulting capacity building within the states. Teachers, content supervisors, TDP coordinators, technology coordinators, and assessment directors from the states must have the time and resources to participate.
State Project Leads. The role of the State Project Lead is critical to the successful participation of a state in a collaborative project. The State Project Lead is essential to communicating the goals and requirements of the state to the Collaborative Organization and communicating progress and result of the Collaborative Organization back to the state. The State Project Lead is also essential to ensuring the participation of teacher subject matter experts on the advisory teams.
The organization and relationship of these key components of the Collaborative Organization are presented in Figure 5.

Use of Technology Supports
Currently available technologies can dramatically alter what, where, when, and how students are tested. For accountability testing, states are looking to technology to mitigate the logistics and costs of test administration and scoring by automating test delivery and reporting of results. More than a dozen states are actively exploring the transition to technology-based assessment, and others are in the planning process (Russell, Goldberg & O’Conner, 2003). To date, online testing programs are focusing on computer delivery of their conventional tests, typically composed of multiple-choice and some constructed response questions. Studies are in progress to compare the validity and reliability of the paper and computer modalities and to probe issues such as student experience learning with computers and item lay out and presentation on the screen (Russell et. al., 2003). Technologies can further support assessment functions by supporting designs of the assessment system, test authoring, item development and review, and automated scoring and reporting.
Technologies transforming the work of scientists are making their way into school curricula. Technologies such as calculators, probe ware, data analysis tools and modeling and visualization software are becoming more prevalent. Simulations will allow presentation of problems impractical in typical classrooms. In time, science assessments will be expected to include complex tasks in which students use such tools.
This report describes strategies and examples states may use to capitalize on a wide array of technologies to support the development of a multi-level state science assessment system. In the sections below, we provide examples of technology supports that are particularly useful. Potential technology supports are linked to elements of the assessment system design and use: 1) alignment of assessments with science standards and curricula, 2) design, access, and use of online collections/pools of items and tasks, 3) administration at multiple levels, 4) automated and rater-based scoring, 4) professional development, and 5) tailored reporting and interpretation of results. In addition, this section describes technologies that can support ongoing collaboration, communication, and dissemination. The section ends with a brief summary of strategies states may employ to phase in technology-based science assessments.
Purposes of technology-supported science assessments. The intended uses of technology-supported science assessments may expand or limit the types and grain size of student knowledge and skills to be assessed, the concomitant forms of evidence of learning to be gathered, and the designs of tasks to elicit observable evidence of science learning. In the multi-level models described in this report, collections of standards-based items and tasks developed according to common specifications are the foundation for linking test forms at classroom, district and state levels. As illustrated in the conceptual framework in Figure 1, classroom, formative assessments would draw more items from the collections to monitor student learning on science standards addressed in multiple curriculum units. These classroom assessments might incorporate more constructed response items and performance assessments than would be assembled in district science assessments. State science tests would be likely to draw fewer performance assessments and constructed response items from the collection for accountability testing. Thus, classroom assessments would draw more items and more formats that would yield information about deeper conceptual understanding and inquiry than would state test forms.
Technology supports for aligning assessments with science standards and curricula. Alignment tools vary according to the intended purpose, user, and use of the alignment information. Box 6 presents an alignment tool under development by Norman Webb at the Wisconsin Center for Educational Research (WCER) designed to produce reports on the alignment of curriculum standards and student assessments. The tool is to be used to establish and document the alignment of a designated set of assessments with a set of standards. Groups of reviewers receive online training in the alignment process, then they enter their alignment judgments. States, districts, and curriculum programs could use this alignment tool to align assessments in the state and collaborative item and task banks with state, district, and national science standards.

Box 6: Example of an Alignment Tool

Another function of technology-based alignment tools is to support searches of online databases or collections for assessment resources. In these tools, the alignments of assessments to standards have been previously established. The alignments have been incorporated into a relational database to enable searches for assessments intended to test one or more standards. Box 7 presents a page from the Web site of Performance Assessment Links in Science (PALS), funded by the National Science Foundation (Quellmalz, 2003). The PALS online collection offers over 300 K-12 standards-based science investigations. Science tasks in the collection are linked to national science and mathematics standards (NSES and NCTM) and to a sample of state science standards. In addition, the science assessment tasks are linked to modules within two sample science curricula, FOSS and STC. State science collaboratives could create such relational databases to allow access to standards-based items in the task and item banks. For example, the common collection could be searched via the NSES or the science standards of each state in the collaborative. The latter search would reinforce to assessment program staff as well as teachers awareness of linkages between standards and assessments.








The PALS Web page in Box 8 illustrates the procedures for searching for tasks by the NSES inquiry standards.




Box 8: Example of Searching for Tasks by NSES Standards
Box 9 presents search results for tasks linked to middle school standards for populations and ecosystems.
Box 9: Example of Result of Search for Tasks Linked to Standards

Box 10 presents an assessment chart from PALS that displays the specific science standards that tasks from the search are intended to test. For example, the performance task on Pond Water Populations is linked to four NSES standards: 1) designing and conducing investigations, 2) developing descriptions and explanations using evidence, 3) using mathematics, and 4) the relationship of ecosystem features to number of organisms. The Predator-Prey performance assessment is linked to only two NSES standards: 1) using appropriate tools and 2) using mathematics. Teachers or assessment staff searching for performance assessments aligned to particular science standards could then click on the title of the tasks and examine them. The assessment planning chart helps users see the combination of tasks that would cover the science standards specified in the search. A collection composed of selected response, constructed response, and performance assessments could create a similar chart displaying coverage of a standard by a set of items and task possible and finally chosen. A variant of such a chart could be a test blueprint that would be gradually completed during a search and selection process.


Box 10: Example of Assessment Chart of Science Standards
Box 11 presents a Web page from a project funded by NSF to develop classroom assessment tools for the science program, Global Observations to Benefit the Environment (GLOBE). This online assessment collection aligns the sample classroom science investigation assessments to the NSES, AAAS Benchmarks, TIMSS and the New Standards Science Reference Exam.
State science collaboratives could use the same alignment protocols to guide alignments of the collaborative task and item pools, as well as science curricula, with national and state science standards. These alignments could then be placed in a relational database to allow searches of the collaborative pools by standards and curriculum units.

Box 12: Example of Globe Assessment Framework Aligned to National, State and Other Standards
Technology-based science assessment collections. Vertical articulation of the multi-level science assessment models depends on pools of items and tasks that can be used for classroom, district, and state assessments and that have been developed according to common specifications. Formation of a collaborative and use of technologies for designing, using, and replenishing the assessment collections are essential supports for multi-level models. The models presented in this report rely on digital libraries of science items and tasks. Technologies cannot only permit storage, access, and retrieval of standards-based science assessments, they can support design and development of the collection.
Technology-supported assessment design. The collaborative will need to create specifications for the kinds of items and tasks to be placed in the pools. Specification shells or templates would be prepared, for example, for types of ecosystem items presented in the assessment development section. Specification shells could guide development of items parallel to the multiple-choice pond water item, the predatory-prey constructed response item, and the pond ecosystem performance assessments. These templates, along with guidelines for using them to create parallel items and tasks, could be shared electronically. Drafts of items and tasks could be reviewed by distributed experts and revisions posted to the collaborative Web site. North Carolina, for example, is beginning to use an online item authoring system (Bazemore, 2004). The GLOBE student assessment project, described above, provides an online assessment template and sample tasks so that teachers can construct assessments tailored to their classrooms. The Principled Assessment Designs for Inquiry (PADI) project is developing a conceptual framework for assessing inquiry and an online library of design patterns, templates, and sample assessments (Mislevy, 2003). The PALS Web site includes a PALS Guide to assist in the customization of science performance assessment tasks in the collection http://pals.sri.com.
Science assessments will also be transitioning toward the use of technologies to assess more complex science content and inquiry (Quellmalz & Haertel, in press). Technologies will permit presentation of task environments and investigations that would not be practicable in classroom settings. Simulations will both translate familiar experiments (e.g., electrical circuits, solutions) into computer-based formats and permit assessment of inquiry skills as students use of analysis, modeling, and visualization tools. Moreover, log files of online investigations will support analysis of dynamic problem solving and strategies. As technology-based activities pervade science curricula, assessment tasks using the same tools can be designed and become part of technology-based science assessments.
Technologies can also facilitate collaboration related to the assessment development process. Secure Web sites can permit posting and discussion related to draft items and tasks, rubrics and student work. A virtual professional development environment, TAPPED IN, allows geographically distributed participants to interact in synchronous and asynchronies modes. TAPPED IN hosts many professional development organizations and is the communication venue for the NSF Centers for Learning and Teaching (Schlager, Fusco, & Schank, 1998).
Use of assessment collections. Pools of science items and tasks can be stored in digital libraries. Secure and non-secure pools can be accessed through different entry requirements. Box 13 presents the Maine Assessment Portfolio Web site which houses collections of standards-based task sets. Box 14 describes the task bank for science and technology. Ten tasks and rubrics are provided for each standard.


The Council of Chief State School Officers (SCASS) science project has developed a collection of selected response, constructed-response, and hands-on performance event and instructionally embedded performance tasks. Box 15 describes the collection that is on disk and can be searched by attributes, words, and item codes.


Box 17: PALS Standards Based Bank of NSES Indexed Assessment Tasks

Box 18: PALM Standards Based Bank of NCTM Indexed Assessment Tasks
The GLOBE Classroom Assessment Tools Web site presented in Box 19 offers a collection over 80 performance 
tasks aligned with earth science and inquiry standards and is linked to the PALS web site (http://globeassment.sri.com). 


Box 19: GLOBE Classroom Assessment Tools
The multi-level science assessment models could both build their own searchable online pools and provide links to other digital collections.
Computer-based assessment delivery and scoring. More and more states are converting their conventional testing to online delivery and scoring. The primary motives are efficiency and cost savings. The 2003 Technology Counts reported that thirteen states (including the District of Columbia) were administering computer-based tests. Oregon pioneered online delivery in its Technology-Enhanced Student Assessment (TESA) project. The costs of developing the online system were offset within a year by savings. Furthermore, studies of the print and online testing results found remarkably comparable item statistics (Neuberger, 2004).
Box 19: Oregon TESA Project


In such computer-delivered projects, multiple-choice questions are being scored automatically, while constructed response questions are scored by raters. Studies are underway to address validity and reliability issues. It has been noted that standards-based testing may be less challenged by cross-modality comparability than by documentation of appropriate opportunities to learn with computers (Russell, Goldberg & O’Connor, 2003).
Technologies can ease substantially the logistics and costs associated with scoring open ended responses. Automated essay scoring of responses over 100 words is becoming increasingly comparable to human scores (Shermis, 2004). The shorter constructed responses, however, still require human scorers. Online rater training and scoring can permit geographically distributed individuals or groups to learn to apply rubrics reliably (Quellmalz & Schank, 1998). Training and score entry can be online, student responses may be digitized or paper. A number of test publishers employ technology-supported, distributed scoring.
Technologies are being used for adaptive testing, both at classroom and state levels. Strategies are being considered for employing state-level adaptive testing while meeting NCLB requirements for administering a common test to all students. One option under consideration is to administer a shorter core test of standards for NCLB reporting and augment the test with additional items adapted to student responses.
Online delivery of science assessments holds considerable potential for efficiency and depth. A multi-level science assessment collaborative could share the costs and labor for piloting possibilities with commonly developed items and technology delivery systems.
Technology-based professional development. The state science models could take advantage of online strategies for promoting assessment literacy in general and professional development related to the science assessment system in particular. Online guides are currently in use to facilitate alignment of assessments--extant and under development-- with science standards, support assessment development and adaptation, provide formal rater training and scoring or informal scoring practice, and guide test interpretation and instructional diagnosis.
The Web Alignment Tool in Box 20, for example, provides instruction and practice on alignment. The site guides the user through an “alignment literacy” overview and tutorial.
Box 20: WCER Alignment Tool


The Maine Assessment Portfolio (MAP) offers an online professional community for teachers to discuss tasks in the bank and practice scoring student work. Box 21 is from the MAP Web site.
Box 21: Maine Assessment Portfolio


Professional development is provided by the PALS Web site for both rater training and scoring and about performance assessment (Box 22). The PALS Guide provides an overview of the components of performance assessment—targets, tasks, and rubrics. As shown in Box 22, information on each of the assessment components includes an introduction and key ideas, guidelines, common errors, and examples. References to additional resources and a glossary on assessment are included. The PALS Guide also presents examples of how tasks in the collection might be adapted for a classroom. The Guide illustrates potential revisions to a PALS task’s standards, the task administration procedures and student response forms, and the rubric (http://pals.sri.com).
Box 22: PALS Guide





The multi-level science assessment collaboratives would develop professional development guides and technology supports for them after taking stock of the approaches being used in member states and common needs of the collaborative. The collaborative might share development of common content and strategies related to science assessments, then incorporate the guidelines in state-specific Web sites or technology-based resources.
Reporting Data: Getting assessment results in the hands of the people who need them in a format that is accessible is a continuing challenge in education. Educators often lack the skills necessary to interpret and use data productively. The technologies for distributing and displaying vast amounts of data have far outstripped the development of technologies and methodologies for converting the data into usable information. The keys to success are the development of tools that easily allow users to drill down to the bottom line. The bottom line is different depending on the users’ needs. Teachers need to know where to go next instructionally with their students. They need to understand where their students are and where they need to go to achieve proficiency. School based administrators need to know where their students are and what instructional programs they need to get in place. System and state administrators need to know where their schools are and what they need to do to support schools in meeting the accountability goals. A variety of web-based systems have been developed for these purposes (Wayman & Yakimowski, 2004). While there are a variety of options, the more successful tools are organized around the schema of technical assistance where data presentations are conceptualized as an instructional opportunity and embedded in a staff development framework.
A number of on-line resources for building the capacity of school leaders to analyze and use data are currently available. States have invested in data analysis online technical assistance in three critical areas: helping educators access and analyze their data, showing educators what the state assessments test and what test items look like, and modeling for educators how student responses are scored.
Box 27 is an example of formative tools focused on helping teachers diagnose student strengths and needs with respect to achieving proficiency on state content standards.
Box 24: WINSS Successful School Guide

To address the multiple audiences for assessment data, web-based guided dialogues with key decision makers enhance the access of assessement data for these audiences.
Box 25: Maryland Data Analysis Tool

Maryland created an on-line course based on multi-media technology for principals to help them understand the instructional targets of Maryland’s accountability system and how to use assessment data to inform school improvement planning.
Box 26: Maryland On-line Staff Development Course

To help teachers understand and use formative assessment Maryland is developing a web-based tool set.

The multi-level science assessment collaborative could design a common or state-specific on-line tools based on commonly developed guidelines for multiple audiences on reporting, interpreting, and using the results of science assessments. The reports and guidelines would include reporting, interpretation, and use guidelines for classroom level assessments, district and program assessments as well as the state assessment.
Phasing in technology-supports for science assessment. States in the collaborative will need to devise a strategy for using these technology supports. An analysis of each member state’s current infrastructure and short and longer term plans will enable the collaborative to set technology priorities and a staging process compatible with available resources and expertise. Some of the state members may be able and ready to pilot or implement some technology supports, so they can serve as test beds for remaining collaborative members. For example, the collaborative may begin with modest technology supports such as using online tools for alignment and creating a searchable assessment collection. Technology supports for assessment designs may begin with posting of specifications and emails or threaded discussions around draft tasks. Professional development may begin with Web sites of sample assessments, rubrics, and scored student work. Online delivery may be piloted in a small set of schools. An advantage of the collaborative will be that different members can take responsibility for leading designs and tryouts of technology supports. An articulated, multi-level state science is likely to develop more fully and quickly if available technologies are harnessed.
Summary
Under the assessment provisions of NCLB, a major challenge for state education agencies is to design and implement an assessment system that provides meaningful information to educators to inform instructional improvement at the program and classroom level and yields accountability information. This report describes components and features of multi-level science assessment models that rely on collaboration and technology to marshal the required resources. The multi-level, articulated system would build and draw upon collections or banks of items and tasks designed according to common specifications. State collaboratives would share resources to build the collections and draw from them to build separate or joint state science tests. The item and task pools would be aligned with separate state standards, yet represent joint efforts to build collections of item/task banks that address priority standards. Teachers and professional development teams would form the core of the development processes and ensure diagnostic utility of the collections.
The report describes two models that serve as exemplars, recognizing that other permutations of the components are possible. The two models describe how states can design instructionally coherent science assessment systems that address both instructional improvement and accountability requirements. The models aim to articulate and integrate science assessment across the classroom, district, and state levels of the system and could be built incrementally within the constraints of state policies and resources. The models are scalable to available state resources. Stages of model development may be influenced by: 1) the nature of collaboration with other states to leverage resources, 2) the breadth and sophistication of technology supports, and 3) the uses of the assessment results at different levels of the system.
Model 1: The State Coalition for Assessment of Learning Environments Using Technology (SCALE Tech) is one possible model for enhancing state science assessment systems. In the SCALE Tech model, states and districts would form a network committed to developing collections of science items and tasks that will elicit evidence of the full range of challenging science standards and to employing technology to support the development and use of such assessments throughout the levels of the educational system. State coalition members would collaborate on the design and development of science assessment item and task collections that can provide evidence of student progress on standards. States might target hard-to-measure inquiry standards such as designing investigations or communicating an evidence-based argument. For example, the coalition could elect to focus shared development on performance assessments that ask students to actually conduct investigations wherein they pose questions, design investigations, collect, analyze, interpret, and explain data and evidence to support conclusions, and communicate findings. Technologies could be phased in to support alignment of assessments with standards and curricula, development and use of collections of items and tasks, automated and teacher-based scoring of student work, online delivery of conventional items and tasks as well as simulations and uses of advanced technologies, and tailored reporting.
Model 2: The Classroom Focused Multi-Level Assessment Model describes an assessment system as an instructional tool for teachers and a decision-support tool for administrators that is transparently embedded in the day-to-day activities of classrooms. Teachers would use the assessment as a formative tool to guide differentiated instruction through an informed understanding of student strengths and weaknesses. Decision-makers would use the assessment results to help evaluate instructional strategies and programs. Administrators would use the assessment results to meet the accountability requirements of NCLB. The assessment system allows flexibility in creating modular assessments based on the needs of teachers and adapting them to meet the information requirements of decision-makers and administrators. Integrating these multiple purposes requires a system of assessment modules that: 1) are aligned with the content standards, 2) are organized in instructionally meaningful units, 3) can be delivered on demand, 4) provide timely diagnostic information for teachers and students, 5) allow measures to be combined over time and across levels, 6) support instructionally meaningful reporting, and 7) contribute to valid and reliable measures for accountability.
The report discusses the development of multi-level science assessment systems based on a large pool of formative and summative items readily available for use by classroom teachers, local assessment designers, and state assessment designers. A development process is proposed that includes the essential steps required to: identify priority standards and development needs for the collaborative, generate item specifications for item development, review items for content quality and sensitivity, develop diagnostic interpretations of student responses, and create test forms. To address the unique information needs of states, the report discusses processes for developing assessment blueprints that allow measures at each level to be combined for analysis and reporting.
Successful implementation of multi-level science assessment systems depends on a variety of technology supports using currently available technologies. Issues are discussed and examples provided of technology supports for science assessments including: aligning assessments with standards and curricula; developing, accessing, and using science assessment collections; designing, delivering, and scoring assessments; delivering professional development; and reporting data. Special consideration is given to how technology-supports for science assessment could be incrementally implemented.
American Association for the Advancement of Science (AAAS). (1993). Benchmarks for science literacy. NY: Oxford University Press.
Anderson, L.W. & Krathwohl, D.R.(Eds.). (2001) A taxonomy for learning, teaching, and assessing. NY: Longman.
American Educational Research Association, American Psychological Association, and National Council of Measurement in Education (AERA/APA/NCME). (2002). Standards for educational and Psychological testing. Washington, DC:
Barth, J. (2004) West Virginia: Exploring State Collaborations for Assessment. NCLB Leadership Summit. Saint Louis, MO. March 2004.
Black, P. & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 81: 139-52.
Bransford, J. D., Brown, A. L., & Cocking, R. R. (2000). How people learn: Brain, mind, experience, and school. Washington, DC: National Academy Press.
Consortium for Policy Research in Education (CPRE). (1995). Helping teachers teach well: Transforming professional development. U.S. Department of Education. Retrieved April 4, 2002 from: http://www.ed.gov/pubs/CPRE/t61/
Frabrizio, L. (2004) Exploring State Collaborations for Assessment. NCLB Leadership Summit. Saint Louis, MO. March 2004.
Klein, S.P., Hamilton, L., McCaffrey, D.F., Stecher, B.M., Robyn, & Bugliari (2002) Teaching practices and student achievement: Report of the first-year findings from the “Mosaic” study of systemic initiatives in mathematics and science. Santa Monica, CA: RAND.
Lawrenz, F. & Huffman, (in press). The archipelago approach to mixed method evaluation. American Journal of Evaluation.
Means, B., Penuel, W. & Quellmalz, E., (2001). Developing assessments for tomorrow's classrooms. In W. Heinecke & I. Blasi (Eds.), Research Methods for Educational Technology. Vol. 1: Methods of Evaluating Educational Technology. Greenwich, CT: Information Age Press.
Mislevy, R. (2003). Overview of the Principled Assessment Design for Inquiry (PADI) project. Paper presented at the annual meeting of the American Educational Research Association, Chicago, Illinois.
National Science Education Standards (NSES). (1995). Washington DC: National Academy Press.
Neuberger, W. (2004). Online Assessment in Oregon: The Technology-Enhanced Student Assessment (TESA). NCLB Leadership Summit. Saint Louis, MO. March 2004.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds). (2001). Knowing What Students Know: The Science and Design of Educational Assessment. Washington, DC: National Academy Press.
Quellmalz, E.S. (2003). Performance Assessment Links in Science (PALS) Final Report. Menlo Park: SRI International.
Quellmalz, E. S. (1987) “Developing Reasoning Skills.” In J. R. Baron & R. J. Sternberg (Eds.), Teaching Thinking Skills: Theory and Practice. New York: Freedman Press.
Quellmalz, E. S. (1984). Designing writing assessments: Balancing fairness, utility, and cost. Educational Evaluation and Policy Analysis, 6, 63-72.
Quellmalz E.S. & Kreikemeier, P. (2004). Testing the Alignment of Items to the NSES Inquiry Standards. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.
Quellmalz, E. S. & Zalles, D. (2002). Designing technology assessments cognitive-
Schafer, W. D. & Moody, M. (2003). Designing accountability assessments for teaching. National Council on Measurement in Education Convention, Chicago.
Shermis, M.D. Automated essay scoring. NCLB Leadership Summit. Saint Louis, MO. March 2004.
Shields, P.M., Marsh, J.A., & Adelman, N.E. (1998). Evaluation of NSF’s Statewide Systemic Initiatives (SSI) program: First year report. Menlo Park, CA: SRI International.
Schlager, M., Fusco, J, & Schank, P. (1998). TAPPED IN Web site. Retrieved April 17, 2002 from: http://tappedin.sri.com/
Stiggins, R. (2002). Assessment for learning. Education Week, 21(26), 30-33.
Wayman, J.C. & Yakimowski, M. (2004). Software to Facilitate Teacher Data Use and NCLB Reporting. NCLB Leadership Summit. Saint Louis, MO. March 2004.
Webb, N.L., Kane, J., Darwin, K., & Yang, J. (2001). Study of the impact of the statewide systemic initiatives program. Technical report to the National Science Foundation. Madison, WI: Wisconsin Center for Education Research.
|