The National Academies: Advisers to the Nation on Science, Engineering, and Medicine
NATIONAL ACADEMY OF SCIENCES NATIONAL ACADEMY OF ENGINEERING INSTITUTE OF MEDICINE NATIONAL RESEARCH COUNCIL
Current Operating Status
USNC HOME

WHAT'S NEW

COMMITTEE MEMBERS

ACTIVITIES

PUBLICATIONS

RELATED LINKS

CONTACT US

BISO HOME

LOCAL SEARCH


DATA FOR SCIENCE AND SOCIETY

The Second National Conference on Scientific and Technical Data

ABSTRACTS OF CONTRIBUTED PAPERS
(arranged by principal author in alphabetical order)

Please note: This site serves as a repository of contributed papers from the U.S. National Committee for CODATA's Second National Conference on Scientific and Technical Data: Data for Science and Society.

These abstracts and contributed papers and the information contained therein are not part of an official report of the National Research Council or the National Academies. The opinions and statements included in the abstracts and the papers are solely those of the individual authors and are not necessarily endorsed or verified as accurate by the National Academies.

_______________________________________________________________

Problems and Solutions in the Integration of Population Data with Other Disparate Data Sets
Deborah Balk (dbalk@ciesin.org ) and Gregory G. Yetman, Center for International Earth Science Information Network (CIESIN), Columbia University

When creating databases that cross disciplines, units of analysis are often compromised. This paper examines three different approaches to data integration, each of which considers the problems of varying and seemingly incompatible analytic units. We highlight the following issues associated with building, maintaining, and using the database: federally and commercially dictated data restrictions, confidentiality, database documentation and metadata, foreign-language translation, and cross-national variable compatibility.

CIESIN uses three approaches to integrate population data with other disparate data sets, which differ in scope (national to global), scale (first- to fourth-level administrative boundaries), and thematic breadth (single variable to multivariate). These approaches include the creation of (1) a gridded global database of population (Gridded Population of the World), (2) a tool to visualize and export data across contiguous national boundaries (U.S.-Mexico Demographic Data Viewer), and (3) a tool to generate equivalencies between U.S. geographies (Geocorr).

All three approaches deal with data integration issues at the sub-national level; GPW and Geocorr also facilitate integration of data collected by administrative units with georeferenced biophysical data. The U.S.-Mexico DDViewer contains social, economic, and health behavioral data for three levels of boundaries. The approaches vary in the problems they address, but all are models highly applicable to other themes and scales.

_______________________________________________________________

A Framework for Science Data Access Using XML
Daniel Crichton (daniel.crichton@jpl.nasa.gov ), J. Steven Hughes, Jason Hyon, and Sean Kelly, Jet Propulsion Laboratory

Science missions and instruments continue to produce volumes of useful data and scientists depend on the data systems and tools that archive these data as a means to access and analyze them. These existing legacy systems do not interoperate well, and scientists must access each data system and its corresponding science data independently through tools that have been custom-built for the particular science data system or mission. The Object Oriented Data Technology task is working on the Distributed Resource Location Service, which will allow geographically distributed data to be located and exchanged. Advances in Internet and distributed object technologies provide an excellent framework for sharing data across multiple data systems. The Extensible Markup Language (XML) and the Common Object Request Broker Architecture (CORBA) provide support for electronic data interchange (EDI) between heterogeneous data sources. CORBA provides the over the wire exchange of XML based profiles that contain descriptive information of science products archived at remote data systems. This paper will discuss a framework for data system interoperability that will not only benefit space science, but also provide a cross-disciplinary solution for a next generation data system architecture.

______________________________________________________________

High Altitude Observatory Data Service: Experience in Interdisciplinary Data Delivery
Peter Fox (pfox@ucar.edu ), Jose Garcia, and Patrick Kellogg, National Center for Atmospheric Research (NCAR)

The High Altitude Observatory (HAO) division of NCAR investigates the sun and the earth's space environment, focusing on the physical processes that govern the sun, the interplanetary environment, and the earth's upper atmosphere. HAO is a focal point for two important programs: (1) the Coupling, Energetics and Dynamics of Atmospheric Regions (CEDAR) program designed to enhance the capability of ground-based instruments to measure the upper atmosphere and to coordinate instrument and model data for the benefit of the scientific community, and (2) the Radiative Inputs from Sun to Earth program designed to address causes of variations in the Sun's radiation as a star as well as the source of radiant energy at the Earth.

In this paper, we detail an effort over the past two years at HAO for uniform access to data services. The underlying technology uses common application programs, like the Interactive Data Language, the Web and the Distributed Oceanographic Data System (DODS). We will present the design, implementation and support of each component including end user search and access via the web and applications, data transmission and subsetting, and data format support on servers. New support was added to DODS for the CEDAR database format and for the Flexible Image Transport System. Since DODS uses URLs to locate data, several server-side functions were designed and implemented to simplify the URL syntax. Close attention is paid to evaluating the productivity and performance of each part of the systems. As a result, a new implementation of the DODS server architecture was developed, using the Apache web server API allowing significant performance improvements in delivering large data sets.

_______________________________________________________________

Increasing Access to Distributed, Multidisciplinary Data through Application of a Biological Species Names Thesaurus
Michael Frame (mike.frame@usgs.gov ) and Michael Ruggiero, U.S. Geological Survey

Much of the scientific and technical data relating to the natural world include references to the scientific and/or vernacular names of the species or higher taxonomic groups that are represented in the data. Biological names are thus the common denominator that can be used to link data from many distributed sources and across disciplines, from molecular biology and genetics to entire ecosystem-level studies. The U.S. National Biological Information Infrastructure program is developing and implementing an innovative approach to using biological names to increase access to multidisciplinary environmental data. This system integrates a scientifically credible and dynamic biological names thesaurus with a suite of web-based indexing and searching tools (including metadata content standards for data set documentation, metatags for web page documentation, and specialized search engine configurations). The relative effectiveness of this approach versus more conventional strategies is demonstrated. The system relies on a partnership between biological scientists involved in building and maintaining the names thesaurus and information scientists interested in using these names to fuel data discovery and access tools. The advantages to both groups in working together on solutions are described.

_______________________________________________________________

Diverse Geospatial Information Integration
Daniel Gordon (dgordon@autometric.com ), Alfred Powell, Phillip Zuzolo, Autometric, Inc.

This paper addresses issues in the integration and display of disparate data sets for science and technology users. The advantages of multi-source, multi-resolution visual fusion will be discussed and demonstrated. Specific examples will be used to demonstrate key advantages of visual fusion using information available from NOAA and other government sources. Such applications are especially important for understanding technically challenging problems and will assist the environmental community in educating the public and scientists about key environmental issues.

______________________________________________________________

A Data System to Integrate Data from Landscapes, Streams, and Estuaries for Determining the Condition of Estuaries on the U.S. Mid-Atlantic Coast
Stephen S. Hale (hale.stephen@epa.gov ) and John F. Paul, U.S. Environmental Protection Agency

Estuaries are natural integrators of substances and processes that occur internally and externally (watersheds, ocean, atmosphere). Watershed activities that contribute fresh water, nutrients, contaminants, and suspended solids have a strong effect on the health of estuaries. Researchers trying to understand the condition of estuaries must do a similar integration, using data from many scientific disciplines. Because these data come from numerous databases, operated by different organizations in various formats, it is often a challenge to find and integrate them. The Mid-Atlantic Integrated Assessment (MAIA), a pilot for projects of the Committee on Environment and Natural Resources, gathered data from many sources in the U.S. mid-Atlantic coastal region and integrated them with MAIA data from landscapes, streams, and estuaries. The purpose was to assess current conditions and to establish a data system that will support continuing assessments. Problems in finding and using data from diverse sources were approached with a variety of data management tools including data directories, inventories of monitoring programs, analytical databases, GIS, data clearinghouses, and data warehouses. Encouraging data owners to move toward common standards, directories, and data descriptions for databases with distributed ownership has made it easier to find, download, understand, and integrate data.

_______________________________________________________________

Practical Challenges in Creating an Integrated National-Level Environmental Data Set: Lessons Learned from the Environmental Sustainability Index
Marc Levy, Jessie Cherry, Alex de Sherbinin, and Francesca Pozzi, Center for International Earth Science Information Network (CIESIN), Columbia University

There is a growing demand for data about environmental sustainability that: covers most of the world's countries, is comparable across those countries, and integrates across physical, biological, and socioeconomic domains. Generating such data poses severe practical challenges. Many variables are missing data for many countries. Some data are point-based; others are available only in gridded or vector format. Often, data are based on voluntary reporting. In many cases a variable desired on conceptual grounds is simply not available.

This paper reports on lessons learned through the creation of the Environmental Sustainability Index, a prototype commissioned by the World Economic Forum in conjunction with the Yale Center for Environmental Law and Policy. A variety of strategies were employed to try to overcome these challenges, including:

developing a non-arbitrary way to limit the number of countries so as to reduce missing-data problems;

using survey data to augment physical measurements;

developing weighting schemes to convert point-based measurements to national aggregates;

using GIS methods to aggregate gridded and vector data;

creating numerical data series from textual reports and other non-quantitative sources; and

developing a heuristic structural model of sustainability to permit meaningful integration across various categories of variables.

In the development of the Index, it is critical to balance the immediate need for usable policy information with the longer term potential for more accurate and complete data on sustainability.

_______________________________________________________________

Advanced Visualization of Scientific Metadata
Ion Mateescu (imatees@ciesin.columbia.edu ), Center for International Earth Science Information Network (CIESIN), Columbia University
Lucy Nowell and Leigh Williams, Pacific Northwest National Laboratory
Karen Moe, National Aeronautics and Space Administration

As more and more metadata catalogs become available on line, researchers face the challenge of information overload. Many search tools such as specialized thesauri and more sophisticated query interfaces help users narrow their search when users know very specifically what they want. However, in many cases, users are not familiar with what data are potentially available and how different types of data may relate to each other. They may need assistance in exploring the contents of one or more distributed data catalogs in order to better understand the universe of potentially relevant data sets and interrelationships among them.

To help alleviate this problem and to increase efficiency in dealing with large metadata collections, the Center for International Earth Science Information Network (CIESIN) at Columbia University is applying research on text visualization to the world of scientific data catalogs. Researchers at Pacific Northwest National Laboratory and CIESIN have teamed up to utilize the search, visualization and analysis capabilities of WebTheme to collections of metadata documents retrieved using the Z39.50 protocol. This paper describes some of the difficulties and opportunities encountered in this prototype project.

_______________________________________________________________

The National Climatic Data Center’s Policy on the Quality Assurance of Daily Temperature Observations
Matthew J. Menne (mmenne@ncdc.noaa.gov) and Michael Crowe, National Climatic Data Center

The National Climatic Data Center (NCDC) has used a variety of quality control/quality assurance techniques to detect errors in temperature and other variables as data are operationally ingested and processed prior to archival. As part of a recent initiative to monitor the "health" of NOAA’s observational networks, new quality assurance methods recently have been developed and added to the existing suite of data processing algorithms in order to improve timely error detection in temperature data from two of these networks: the National Weather Service Cooperative Observer Network and the Automated Surface Observation System. While enhanced error detection of observations from these and other NOAA observational platforms will benefit the scientific community in assessments of climate and global change, the quality assurance of weather and climate data has received increased attention with the addition of a growing private sector constituency--weather risk management. In fact, the user needs of this major new constituency have provided the incentive for the NCDC to revisit issues such as the impact of evolving quality assurance methods on historic archives and its policy on supplying substitute values for missing values and for observations flagged as potential errors. This paper will report on progress towards the formulation of a consistent policy regarding these issues.

_______________________________________________________________

Integrating Data from the Internet: Are Metadata All We Need?
R.J. Olson (rjo@ornl.gov) and R.A. McCord, Oak Ridge National Laboratory*

Ecologists are mining data from the Internet in addition to collecting field measurements to address today's questions about the impacts of man's activities on regional and global processes. The Internet provides information on aspects of the environment at broader spatial and temporal perspectives than an ecologist could easily acquire from their own fieldwork. Despite the ever-increasing computer power and freely available databases, integrating data from multiple sources is difficult. A major barrier is having adequate metadata; however, even with well-documented data from fully functional archives, these data often cannot be easily combined to produce credible results. When data are combined, new problems and opportunities arise. For example, in a project to evaluate the performance of regional ecosystem models, the model outputs were used to identify unreasonable combinations of climate, land cover, and productivity data that had been compiled from separate sources. A major effort was invested in cleaning up these data for the analysis, including evaluating the outliers by a diverse group of scientists. The new integrated data set was a significant product in itself. Based on our experience with this and other projects, our poster will illustrate pitfalls and ways to take advantage of combining information from multiple sources.

*Oak Ridge National Laboratory is managed by Lockheed Martin Energy Research corp., under contract DE-AC05-96OR22464 with the U.S. Department of Energy. The U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allows others to do so, for U.S. Government purposes.

_______________________________________________________________

Data for Assessment of Irrigated Soils Degradation and Management of Long-term Soil Preservation
V. Prikhodko* (prikhodko@ucdavis.edu), M.J. Singer,# E. Manakhova+
*Visiting researcher, Department of Land, Air and Water Resources, University of California at Davis
#Department of Land, Air and Water Resources, University of California at Davis
+Post-graduate student, Moscow University, Russia

Irrigation is one of the most powerful anthropogenic factors that influences soils, water and the environment. Seventy percent of water utilized by humanity is used for irrigation. Irrigation provides favorable conditions for plants, but frequently results in soil deterioration. Degradation can vary from slight, meaning little change in soil functioning in the biosphere and < 10% crop yield reduction, to very severe, in which soil functions are severely affected and crop yields are decreased by 75-90%. Both static and dynamic soil properties are affected. Simultaneous evaluation of soil degradation may be determined by one or many parameters. Early diagnosis of negative transformations caused by irrigation is facilitated by the study of micro- and meso-scale soil properties. Reversible (e.g. changes in humus content, waterstable aggregates, gypsum, other salts, and nutrients) and irreversible processes (e.g. microstructure destruction, reduction of total and reactive organic content and clay loss) are considered. Quantitative parameters that characterize favorable conditions of irrigated soils, and measures of the degree of degradation will be presented. Examples of irrigation effects on soil carbon, soil structure and porosity from the University of California Long Term Research in Agricultural Systems (LTRAS) plots will be presented to illustrate kinds of management effects.

_______________________________________________________________

Better Access Through Federal and Private Sector Data Integration

Hedy J. Rossmeissl (hjrossmeissl@usgs.gov), U.S. Geological Survey

As an agency of the Federal government, the U.S. Geological Survey (USGS) makes available a wide range of earth and biological science data in the public domain. Tools are being developed to access these multiple data sets in an integrated manner. The Earth Explorer software will allow cross inventory searches; the National Atlas is an integrated database of small-scale national data sets produced by numerous Federal agencies.

In addition, the USGS constantly examines its policies and actions to ensure that its activities are inherently governmental and that the private sector is being encouraged and enabled to reprocess and serve these data in formats, combinations, and products that meet the needs of their customers. Successful ventures with the private sector have been through business partner agreements, of which the USGS has several dozen relating to its digital data, and cooperative research and development agreements (CRADAs). For example, the TerraServer was developed under a CRADA with Microsoft. Lexon Technologies is developing consumer products from the National Atlas data. The USGS has an agreement with Pictometry to develop an image product that incorporates USGS digital data. The private sector is also being encouraged to value-add satellite data. These policies provide the public with far wider access to USGS data in ways that better meet their requirements.

_______________________________________________________________

A Case Study of Environmental Research Data Management

Trent G. Schade (schade.trent@epa.gov), U.S. Environmental Protection Agency

In order to support our ongoing research in watershed ecology and global climate change, we gather and analyze environmental data from several government agencies. This case study demonstrates a researcher's approach to accessing, organizing, and using intersectoral data. The research topic is an assessment of the potential impact of global climate change on engineered environmental systems.

Data providers include government agencies, commercial contractors, and professional organizations. For example, NOAA provides precipitation data; USGS provides streamflow data and channel geometry; EPA provides point-source permit data; and cost data are provided by EPA as well as professional organizations.

Each group has a unique access requirement and a unique data format. The researcher must shoehorn these data into a format appropriate for the model or method applicable to the problem. For this topic, we require the following models:

Water Quality Models - USGS and EPA

Climate Change Models - EPA

Statistical Models - Commercial Software

Synthesizing and analyzing the results from the models is another data management task. In the end, a key goal of the data management is to ease technology transfer; our research should adapt to the efforts of technology-limited data users such as local watershed groups. Our metrics for meeting this goal are economy and efficiency

_______________________________________________________________

Providing a Common Search and Data Usage Facility for Independent Space Physics Data Centres

James Thieman (thieman@nssdc.gsfc.nasa.gov) and Edward Bell, National Space Science Data Center

Michael A. Hapgood, CLRC Rutherford Appleton Laboratory

Christopher. C. Harvey, Centre de Données de la Physique des Plasmas, CNRS/CESR

J. David Winningham, Southwest Research Institute

The space physics community has assembled large and diverse data holdings--catalogues, databases, simulations, and digital data archives from both space- and ground-based facilities--most of which are available on-line in a wide variety of formats via network services. The goal of this initiative is to find a way to link these data resources so that space physicists can easily locate and use data of interest regardless of which of the many facilities actually holds the data. Any approach to providing this facility must impose minimal impact on data providers, utilize existing network access tools, and require little or no addition to project budgets for implementation. One approach to creating this facility is to model it after the Astrobrowse system presently used for data finding by the astronomy community. Similarly, it is proposed that the acquisition and intercomparison of data be done in a manner similar to the current Distributed Oceanographic Data System (DODS). There are many lessons to be learned from the Astrobrowse and DODS approaches that can be applied to the space physics problem. A prototype implementation planned for year 2000 will be described.

_______________________________________________________________

The DOE ARM Program Data Management System

Joyce Tichler (tichler@bnl.gov), Wanda Ferrell, Raymond McCord, and Jimmy Voyles, U.S. Department of Energy

The Atmospheric Radiation Measurement (ARM) Program is the largest global change research program within the Department of Energy (DOE). The program investigates the role of clouds in climate models, a critical scientific issue identified by the United States Global Change Research Program. The ARM Program established and operates field research sites in three climatically significant locations. Scientists collect and analyze data obtained over extended periods of time from large arrays of instruments to study the effects and interactions of sunlight, radiant energy, and clouds on temperatures, weather, and climate.

A team of scientists and computational scientists from DOE national laboratories designed and built the ARM data management system. Data and accompanying metadata are collected at the three sites and from auxiliary sources, converted to a self-describing data format (netCDF or HDF) and transferred to the ARM Archive for distribution to ARM scientists and to the general scientific community.

The early years of the data management effort were dominated by implementation of the infrastructure necessary to collect, format and archive the data; more recently dominant issues have turned to ensuring the quality and usefulness of ARM data. This paper will document the current status of the ARM data system and discuss future plans.

_______________________________________________________________

Calculational Data Base for Unimolecular Processes

Wing Tsang (wing.tsang@nist.gov) and Vladmir Mokrushin, National Institute of Standards and Technology

Chemical kinetic and thermodynamic data represent essential inputs for the simulation of real world processes. Such data have traditionally been presented in the form of extensive tables or equations. The variety of conditions where such data are applied make the traditional approach limiting. We describe an alternative approach in which theory and modern computational capabilities are combined to generate data as they are required. The application is for unimolecular reactions. The approach has general applicability and is in line with modern capabilities in science and technology. The results are in a recently completed user friendly WINDOWS program. Input data are the properties of the molecules and transition states and their interactions with the bath molecules. Rate constants for Boltzmann systems are then calculated on the basis of the partition functions. RRKM theory is used to derive specific rate constants as a function of energy and angular momentum. When combined with parameters that describe energy transfer processes, rate constants to all conditions are derived through the solution of the time dependent master equation. The database is reduced to the input parameters and can be applied to practically an infinite variety of conditions and easily updated.

_______________________________________________________________

Mercury: Managing Distributed Multidisciplinary Scientific Data*

L. D. Voorhees (ldv@ornl.gov), P. Kanciruk, B. T. Rhyne, S. E. Attenberger, Oak Ridge National Laboratory+

Large-scale multidisciplinary field investigations typically involve many investigators and can result in thousands of data files. Often, sharing data from these investigations among researchers throughout the world does not occur for several years after the study has been completed, in part because of the effort required to document, organize, and present highly diverse, distributed data. In addition, traditional centralized data systems for storing and searching metadata are time-consuming and costly to develop, require significant resources to operate, and are frequently out-of-date. We have developed a modern Web-based system, Mercury (http://mercury.ornl.gov), which assists the investigator in documenting data and allows them to maintain control of their data and its metadata. Mercury uses a combination of commercial off the shelf (COTS) software, custom software, and metadata standards to provide an economical, dynamic, and rapidly deployable system. Using XML, metadata are coded into HTML documentation files on an investigator’s server, which are periodically harvested by an HTTP retrieval program. The results are used to automatically build a searchable index of metadata. A user of this system searches this index through a Web-based interface, which provides links back to the documentation and data files located on investigators’ servers. This new way of sharing data and information among researchers throughout the world greatly facilitates the research process and can be applied to many kinds of projects, regardless of discipline.

*This work is sponsored by the National Aeronautics and Space Administration.

+Oak Ridge National Laboratory is managed by Lockheed Martin Energy Research Corporation for the U.S. Department of Energy under contract DE-AC05-96OR22464.

_______________________________________________________________

Challenges in Managing Model-Generated Data: Supporting an Open International Scientific Assessment Process

Xiaoshi Xing (xing@ciesin.org ) and Robert S. Chen, Center for International Earth Science Information Network (CIESIN), Columbia University

Data generated by computer-based simulation models have not generally received the same level of attention as observational data from a data management perspective. However, computer models in areas such as global climate change are increasingly being used in interdisciplinary research and assessment efforts and in national and international policy discussions. Intercomparison of different models and results, often developed by different research groups around the world, is vital especially in a timeframe that is difficult for traditional processes of scientific review and publication to accommodate.

Working closely with Working Group III of the Intergovernmental Panel on Climate Change (IPCC), CIESIN developed an online World Wide Web (WWW) site to support an "open process" of international scientific review and exchange for the Special Report on Emission Scenarios (SRES). This WWW site provided interactive access to a variety of scenarios and supporting materials developed by the Working Group and a means for the international scientific community to submit comments and new scenarios for Working Group consideration. The collaborative effort also gave CIESIN a unique opportunity to deal with the archiving and documentation needs of an unusual set of model-generated data. This paper describes key lessons learned in supporting the IPCC Open Process and in managing complex model-based data sets.

RSS News Feed | Subscribe to e-newsletters | Feedback | Back to Top