|
Statistical Issues in Data Acquisition: Workshop Summary
In today’s information age, scientists rely chiefly on statistical modeling and analysis in order to manage massive amounts of data. Data acquisition is one realm where the use of these techniques can be used in order to aid scientists in capturing and parsing data. Specifically, researchers often ask “How can data be collected in order to discover unanticipated information?” In an effort to address this and other questions involving data acquisition, the Committee of Applied and Theoretical Statistics (CATS) of the National Research Council held a day-long workshop hosted at Lawrence Berkeley National Laboratory (LBL) on July 16, 2004. Statisticians and scientists from fields such as high-energy physics, earth science, and high performance computing discussed statistical techniques and methodologies from their research and highlighted current problems and solutions.
Robert Jacobsen, a high-energy physicist at LBL, opened the workshop with a presentation of the standard model of particle physics. In this model, physical events are constructed by chains of particle interactions, and these chains are constructed backwards through statistical inference. However, Dr. Jacobsen stated that most physicists are not trained in uncertainty analysis, and therefore “observations that fit don’t always make the cut because throwing them out lowers uncertainty”.
A number of statistical challenges arise because of a lack of distinction between “data collection” and “data processing”. Julian Borill, an LBL cosmologist, explained that while many large, data collecting organizations—such as NASA and NSF’s ground-based facilities—publish their data for scientific use, there is generally no specification of what, if any, refinements have been made to the raw data. On one hand, scientists may unintentionally use data that has been made biased by certain types of pre-processing, and on the other hand, scientists may find that without any processing, data sets can be too large or noisy to be useful. Wendy Meiring, a statistician at the University of California at Santa Barbara, echoed a similar concern. In her work with ozone data, there is debate over revising and re-releasing data in an effort to correct for deficiencies that were discovered years later in the data collection instruments.
Many workshop participants expressed the need for data owners to work with statisticians during the data gathering process. Amy Braverman, often the only statistician among her collaborators at NASA’s Jet Propulsion Laboratory, suggested that within the simulation and modeling community, there is often a misconception that statisticians work entirely on error estimates and are not seen as experts in variation. One workshop participant commented that a reason for this lack of communication is that statisticians tend to define themselves by methodology (Bayesian, etc.) as opposed to data type (climate data, for example). Others agreed that if statisticians organized themselves according to application area, it would help to bridge the gap between the statistics and non-statistics communities.
Two workshop attendees discussed current high-performance computing tools available for data acquisition. George Ostrouchov, a computer scientist at Oak Ridge National Laboratory, gave an overview of DOE’s Scientific Discovery through Advanced Computing (SciDAC) program. This program aims to develop software and hardware needed for terascale computers to run PDE-based finite element simulations. Jogesh Babu, a statistician at Penn State University, introduced the workshop participants to www.vostat.org, a web based service providing a suite of statistical tools designed for astronomers working with large data sets. The VOStat project is a joint effort led by Penn State University in collaboration with Carnegie Mellon Univesity and California Institute of Technology.
Overall, the participants found the workshop discussions useful and a number of participants discussed the possibility of future collaborations.
|