USNC/CODATA Discussions on S&T Data Management and Policy

 

International and Domestic Data Issues Discussion:

Scientific and Technical Data Needs and Opportunities

April 9, 1999

TRANSCRIPT

INTRODUCTORY REMARKS

DR. OERTEL: Good morning and welcome to the U.S. National Committee for CODATA’s International and Domestic Issues Discussion on Scientific and Technical Data Needs and Opportunities. I am Goetz Oertel, chair of the U.S. National Committee (USNC) for CODATA. I wanted to first of all welcome the people who are here from the various federal scientific agencies including: Peter Weiss, Office of Management and Budget (OMB); Alan Gaines, National Science Foundation (NSF); Gerald Barton, National Oceanic and Atmospheric Administration (NOAA); Hedy Rossmeissl, U.S. Geological Survey (USGS); Robert Shepanek, Environmental Protection Agency (EPA), Wanda Ferrell, Department of Energy (DOE); Judy Vaitukaitis and Richard Dubois, National Center for Research Resources at National Institutes of Health (NIH); Elliot Siegel, National Library of Medicine (NLM) at NIH; Pamela Andre, National Agricultural Library (NAL) at U.S. Department of Agriculture (USDA); Joe Bredekamp and Lola Olsen, National Aeronautics and Space Administration (NASA); Kurt Molholm, Defense Technical Information Center (DTIC).1

Before we begin, let me call on Paul Uhlir, director of the USNC/CODATA, who will give a thumbnail sketch of CODATA.

MR. UHLIR: Thank you all for coming. I think most of you are fairly familiar with the work of the National Research Council and with this committee, but for those of you who aren't, let me give you a quick background. This forum was convened by the U.S. National Committee for the Committee on Data for Science and Technology (CODATA). CODATA is an interdisciplinary committee of the International Council for Science (ICSU), which is a non-governmental organization. The U.S. National Committee for CODATA is the U.S. member body to that group. There are 31 other scientific unions and interdisciplinary committees affiliated with ICSU, and the National Research Council administers all the U.S. committees that adhere to ICSU. Many of the U.S. national committees actually are not the typical kind of NRC committees that conduct studies and produce reports and do other kinds of substantive work. They really serve more of a liaison function, for the U.S. scientific community and the international ICSU unions. They provide support for those unions and send delegates to the international meetings. In the case of the USNC/CODATA and some of the other interdisciplinary committees, such as the International Geopshere-Biosphere Committee and other interdisciplinary environmental groups, they actually do conduct studies or hold workshops and conferences under the National Research Council auspices, in addition to the international liaison kinds of activities.

I believe all of you have received a brief one-pager along with your letter of invitation, which gave a summary of CODATA and the U.S. National Committee, and listed some of the major activities that the U.S. National Committee has undertaken since 1991, when I took over the direction of the committee. I am not going to go over that material, which noted several studies and a national conference on scientific and technical data exchange and integration. We are now planning another data conference for March 2000 with our sponsors and we also are planning to hold an international CODATA conference and general assembly in 2002. In addition, we are planning for next year bilateral meetings with the Chinese National Committee to CODATA, which would be held in conjunction with our data conference next March and then in China later in the year. So, those are three activities that we are planning now. However, we are also interested in hearing your views on the important issues that federal data managers are facing on an interdisciplinary data management and policy basis.

Unless there are any questions, I will turn the discussion back over to Goetz Oertel and then to John Rumble, President of CODATA, to talk about the international CODATA activities, which are distinct from the U.S. National Committee activities, but which we, of course, help coordinate and support.

DR. OERTEL: I will continue with this very brief introduction. I just want to tell you that the USNC/CODATA is here to communicate with the federal data managers. We are here to hear from you and to listen to your concerns and issues that you may have. We have asked the invited data managers to present their top two or three concerns, and it will be extremely interesting to see how that works out when we compare notes with the different agency speakers.

Of course, we want to see how CODATA may be able to help on the issues and concerns that you are facing. The time is auspicious because the creation, transmission, and use of data continues to increase at an exponential rate. Sea changes are taking place in the way we do research, and in the legal and regulatory environment and the business environment. More of that is to be expected as this change continues. It is a very exciting time in this area. It is also a time we are probably in a position to be more effective than ever before in CODATA because CODATA has just elected John Rumble, chief of the Standard Reference Data Division at the National Institute of Standards and Technology, as the new president of CODATA. John is committed to making CODATA a force to be reckoned with and to be active and to influence things on an international level in our common interest, in the interest that we share with scientists in other countries.

So, I promise that we will pay close attention to what you say and we will give you feedback. We are going to have your presentations recorded by a transcription service, in order to enable staff to put together a complete and concise summary of the presentations. We certainly don't want to lose any of the good points that you will be making, which could happen without a formal transcript.

Let me also say that a general discussion will be very welcome. We, therefore, would like you to stay within the allotted time of five to ten minutes. I will give a first indicator to speakers after about six minutes and a second after eight and the hook after ten in the hope that we will have some time left over for discussion of some of the key issues.

I would now like to turn the discussion over to John Rumble, who will tell you what international CODATA is up to.

INTERNATIONAL CODATA ACTIVITIES2

DR. RUMBLE: I want to say a few words about CODATA and then a few words about why we had this session. CODATA, as Goetz and Paul both pointed out, is an interdisciplinary committee, conducted under auspices of the International Council of Science. It is about 30 years old and its interest is in improving the quality, reliability, management, and accessibility of data of importance in all fields of science and technology. "Data" are not defined as the full range of scientific and technical information, not full text, not effective indexing, but quantitative data that are important in science and engineering.

CODATA really is a group of people, who have a great deal of expertise in working with data in all these different scientific areas. This is meant to be inclusive, not exclusive; there are other areas, including the social sciences, that actually handle quantitative and numeric data and CODATA is interested in them all. We have participation from virtually every area of science of people who are very interested in doing things with data, whether just collecting them for two or three years or analyzing them, manipulating them, providing access to them—everything that you do with numeric data to make them useful to scientists and engineers throughout the world.

CODATA’s objectives are really recapitulated on the last two slides (see Appendix B). We are focused on improving the use, improving the accessibility, and improving the quality of data. How you actually do that is by fostering the international cooperation among people throughout the world who are interested in scientific and technical data and also by promoting the importance of the use and collection and access to data. That is really how CODATA accomplishes our goals.

CODATA provides an umbrella organization for international or multinational data projects to work under. We work on standards— either formal or informal standards for collecting, comparing, and exchanging data. The buzzwords, of course, today are data integration and data exchange. We traditionally have provided information on directories to data, on building directories. We also have done some education and training and that takes place both in the form of workshops and also special outreach courses. In particular, we hold a biennial conference. Our last one was convened in November1998 in New Delhi and the next one will be in Italy in the fall of 2000. These conferences provide, I think, a remarkably vibrant forum for bringing together people who are interested in the management of scientific data to share ideas, to meet new people, to start developing bilateral relationships, which lead to very useful and important data projects, and, finally, to prepare key data sets, such as tables of fundamental constants.

There are 23 member countries plus the Academy at Taiwan, who are the primary members of CODATA and membership is on the basis of countries. Each member country or the academy has a national committee similar to the U.S. National Committee, which provides a focus for CODATA activities within that country and also it is a source of talent to help administer the program. CODATA also has many liaisons with the international scientific unions, such as the International Union of Pure and Applied Chemistry, International Union of Biological Sciences, and so on and so forth. And we have some supporting organizations and we have task groups, which I will mention in a minute.

Now, why is CODATA becoming more and more important now? Why have we tried to gather you here today to talk about your data needs and what CODATA can do to help address them? That is because, of course, of this incredible information revolution that is going on in science and technology. First is the Internet. We don't have to say anything more about it, but it is just totally changing the way we do the science. The Internet is facilitating international collaboration. Finding data and information is much easier than ever before through this medium, though we don't always know the quality of that information.

The second factor is, when I started in this business 25 years ago, the first thing I had to do was write my own database management systems and it is a task that I can assure you you don't really ever want to do. Today, anyone can get access to computer tools that are powerful, robust, and operated in all sorts of environments just simply by buying commercial software packages. So, the ability for a scientist or an engineer or a group of scientists or engineers to collect data and manipulate them using modern information technology is not trivial, but it is so much easier than it ever has been. That means that researchers, who are constantly looking for better ways of doing their work, now do have better ways of doing that.

The third factor, and perhaps even more important to me, is the realization that with modern computing, networking, mathematics, and physical theory that we are really able to do the modeling and simulation of extremely complex phenomena and to get very usable scientific and engineering predictions. But, furthermore, we also realize that for modeling and simulation really to work, you need good data. Garbage in, garbage out is probably never more relevant than today.

Obviously, this revolution is making it easier to do data work and it is easier for data work to begin to get into R&D and that is a really significant point, but a couple of things are happening. One is that as people in individual disciplines learn the lessons about collecting, manipulating, using, and providing access to data, they are learning very important lessons that really need to be shared with the entire scientific community. Even though disciplines such as bioinformatics develop very specialized tools, many of the lessons that they learn in developing those tools and much of that knowledge is really directly applicable to other disciplines.

As we develop tools, such as data mining, which really exploit databases, sharing this knowledge is going to become more and more important, not only on a discipline basis, but also on an international basis. This is why we think CODATA can play a very important role in helping groups, such as those represented here, who have formal responsibility for national data activities, to cooperate with other scientific disciplines and combine to share the knowledge that is developed to gain insight on knowledge that other people have developed to serve your particular agenda.

I mentioned that CODATA has a number of task groups that have been going on for some time in the biological sciences, in physics, and in materials science and then some that deal more with getting people to work together. We have also started other very important activities, which are aimed at data quality, data evaluation, and the validation of tools that are used for data collection, data evaluation, and data mining.

Some CODATA task groups attempt to define specific formats for reporting data and sharing data on a very large discipline-wide area, while some other groups are addressing specific problems. For example, there is a regional task group focusing on Scandinavia and Northern Russia, making sure that environmental data related to the disposal of radioactive material in ocean waters in that area are of high quality so that people can figure out what kinds of problems they have and also the severity of those problems.

I put this slide up here just to give you a flavor of the countries that are really involved in not just being in CODATA, but running CODATA. You can see they really span the major countries of the world that do research and development, and that do data work. It is through these people and their colleagues, who attend the meetings, that CODATA has the opportunity to make liaisons at both the highest level and at a working level that would allow groups within the federal government and the private sector here in the United States to advance their national data agendas into the international arena through these kinds of activities.

I would like to end by saying I feel a personal responsibility for getting many of you people here today away from your everyday schedule to talk about these events and, you know, I really appreciate Paul Uhlir and Goetz Oertel responding to my request to get this community together—people who worry about numeric scientific data, quantitative scientific data, to share with us what your agendas are.

We too have an agenda today. First on that agenda is to make you aware that CODATA is a vibrant, exciting organization, and that I, Lois Blaine, who is on the CODTATA executive committee and Paul and Goetz, who represent the U.S. activity here, are very much open to your suggestions for work that will help you in the way that advances your data activities. We are all equally available and any time any of you would like to talk with me about possibilities of maybe we could get a small group of people together internationally. We have, I think, something to offer you.

Last, but not least, is that as you speak, even though there is a rapporteur here, I actually have a very, very good memory. I hear interesting ideas. I will file them away and you can be sure I will get in touch with you because I think this is really probably the most exciting time in the data arena. I think that we have wonderful opportunities to advance our data agenda, especially on the international level, and I would be more than happy to work with you and the rest of the U.S. National Committee, to achieve some of the goals. With that, I will end and take questions, but I would rather keep questions to me rather limited, so that we have time to go through everybody today.

DR. OERTEL: Thank you, John.

DR. GAINES: If I may raise one question, you stressed numeric data.

DR. RUMBLE: Quantitative, I think, is a better word.

DR. GAINES: Okay. I was going to say if that includes zeros and ones, then I think we are all happy.

DR. RUMBLE: –and modern database management is how you differentiate between a zero and null and not applicable. So, we have a lot of good discussions about that.

DR. VAITUKAITIS: When one sets standards, how does one then investigate the plot of those standards?

DR. RUMBLE: Briefly, these standards are an economic activity and there is almost virtually no way you can make people do that. In industry, the way they do that is groups like ASTM develop a standard and companies say when we work together we will reference this standard, have to do things that way. Scientists don't have that kind of economic activity. However, agencies have projects where people are collecting data and going to submit them to a common repository. The contracts and grants for these projects can call out those standards.

Second, if you do your standards work well and it is clear, understandable, robust, and meaningful, the community often will band together and use it voluntarily. There are examples of this in crystallography, surface science, and some of the bioinfomatics. Medical researchers are some of the poorest advocates of standards of that.

DR. OERTEL: Thank you very much.

I would just add to that a couple of words on the activities of the U.S. National Committee. When I first became involved with the committee, the emphasis was on studies, including Finding the Forest in the Trees (NAP, 1995) and Bits of Power (NAP, 1997). Little did we know at the time how important these studies would be. When major policy developments came about in the intellectual property rights arena, and more recently in the legislated Freedom of Information Act extension to allow for access to research data in an unprecedented way, we had the staff here prepared with consensus positions, reports that had gone through the NRC system and that had the NRC's blessing. These reports turned out to be very powerful in presenting the position of the Academy and of others in this country against what might have been and still could be major intrusions in our ability to do our jobs in government and elsewhere in the United States.

So, I would say that without us necessarily knowing where we were headed, we were well prepared for it. One of the real challenges that I see is to know where the challenges of the future are going to be, what is going to bite us next. We certainly didn't anticipate the recent FOIA data access provision. The legislation came out of the blue, and the committee discussed this topic yesterday and we will continue to think about that issue. This is another area in which you can make important contributions today and tell us what, perhaps, you see coming around the bend next. So, without further ado, I would like to call first on Peter Weiss from OMB.

DISCUSSION WITH AGENCY DATA MANAGERS

Peter Weiss, Office of Management and Budget

MR. WEISS: I am going to be brief, if that is humanly possible for me. First, a disclaimer. I am going to be blunt and brutal here this morning and the views expressed are my own and do not necessarily reflect the Administration or OMB.

There is a single major issue and a single major challenge that needs to be confronted. A lot of people have been concerned about the imperfections in carrying out United States domestic data policy in the real world, or about the scientific issues around database protection and issues around data under grants, but all of these are merely subcategories of something much bigger. The biggest mistake you all can make is to presume that the policy of open and unrestricted access to scientific and technical data as enshrined in both statutes and policy in the United States is number one, secure and, number two, not potentially severely undermined by things that are going on around the world.

Complacency on this point is your enemy. You should have, I hope, my article that appeared in Borders in Cyberspace,3 talking about the growing trend, particularly among our European friends, of what I call "government commercialization." This is the trend to attempt to create government data monopolies for the stated purpose of revenue generation, but with the clear secondary purpose of bureaucratic self-aggrandizement. We see it most specifically in the European Union (EU) weather services, and I use the World Meteorological Organization (WMO) Resolution 40 as a case study of this phenomenon. However, we are seeing it more and more in other areas. The Royal Ordinance Survey in Great Britain is an excellent early example of the concept of governmental authorities telling scientific and technical data agencies to get their butts off budget and attempt to aggressively use government copyright to recoup costs. The bureaucrats love this, because it allows them to make believe they are entrepreneurs. The concept of the "entrepreneurial bureaucrat," I suggest, is an oxymoron.

This trait is being further encouraged by the European Union’s Database Directive. Again, I will try to show you how these are connected. The European Database Directive specifically contemplates government agencies protecting otherwise non-copyrightable data. Whatever you think of the merits or lack thereof of database protection legislation in the United States with regard to privately developed information, it is critically important that any database protection legislation in the United States, number one, not apply to data generated with taxpayers' dollars.

Number two, we need affirmatively to roll back this emerging policy and practices in the European Union, and we need to convince anyone we can both in the European Union and around the world that this is the major threat. You have in your handouts the DG-XIII so-called "Green Paper" on "Public Sector Information: A Key Resource for Europe".4 This is a wonderful document that says all kinds of great things. I accompanied Jack Kelly, head of the National Weather Service, to Geneva, in February to meet with our European Union and WMO counterparts. Not only do they defend the specific policies that require them to be entrepreneurial bureaucrats, they not only stick to their right to be entrepreneurial bureaucrats, but when the Green Paper was pointed out to them, it was absolutely fascinating because consistently they all said DG-XIII does not know what it is doing. They said that DG-XIII is not really a player in this area, and that they (not DG-XIII) represent the policy of their governments. This is just a green paper, they say. They basically said that it is not going to go anywhere and that they have every intention of ignoring it. Were I not so naive, I wouldn't have been surprised. But being naive, I was surprised. I have been unable to determine from a purely political perspective what the future of this document is. And that is key, absolutely key.

The last point that I want to make is that I am not a scientist. I am not a data person. I am just a country lawyer, but I have been hanging around Washington now for over 20 years and I am very experienced in congressional and other political issues. I have become over the last couple of years shocked and chagrined by the fact, as I perceive it, that both domestically and internationally the scientific community has been remarkably ineffective politically. Domestically, I don't know why, but the scientific, research and educational communities have not been able to gain traction legislatively on the database protection issue. The scientific community, such as the National Academies and other groups, has written wonderful letters, and stuff like that. But the political traction just doesn't seem to be there. It is very regrettable. I don't know why that is.

This whole issue of data access under grants, the so-called "FOIA issue," I have to tell you if you really thought that it came out of the blue, you all have been asleep. This issue has been kicking around in Congress for over a decade. It was inevitable that at some time an agency would engage in a data policy practice, which became perceived, rightly or wrongly, as obnoxious. As soon as that occurred, this type of a reaction was inevitable. As a matter of fact, I am told by a friend of mine, who is a FOIA expert, that this issue was explicitly raised almost 20 years ago before the American Bar Association. Although—and because at that point in time there wasn't a squeaky wheel—nothing became of it. There are equities on both sides that need to be balanced. Regrettably, what happened in the 1999 Omnibus Appropriations Bill was that some extremely poor legislative drafting occurred.

Again, I note that grantee folks tend to look to their grant agencies to be their protectors. A good buddy of mine is a grant manager at NIH and they have been inundated by people complaining that they "do something" about this so-called FOIA issue. I can tell you that the people on the Hill are not interested in anything that any of the grant agencies have to say.

So, the bottom line is two-fold. Number one, we have a major challenge. It is not a given that open and unrestricted data is the policy of the future. It will only be the policy of the future if it is successfully fought for. Secondly, both domestically and internationally, the scientific community has to get political traction. I don't know how you do that, but you don't have it now. And I don't know what role CODATA or ICSU can or should play in that. I leave it to you all.

DR. OERTEL: Thank you.

DR. RUMBLE: One quick comment is that ICSU and CODATA have a Data Access Commission, and the chair of that and I were talking about the fact that the EU is going to be reviewing the database protection law. When I am over there next week, we are going to arrange a time to have a meeting in Europe to get the European community at least to raise their voice somewhat in advance this time because last time they were caught totally off guard.

MR. WEISS: My free advice would be this. When we were meeting with the European weather people and they were disparaging the "Green Paper," I said, "if we were to approach your respective commerce, science, and environment ministers and ask them if they shared your view, what would they say?" Their answer was of course they would support us meteorological folks, but they got nervous when I asked them that. That should be a hint as to where you might want to go politically.

DR. OERTEL: You mentioned that the process was flawed, particularly how the FOIA issue got through the system the last time. Is there any way that you all might be able to police that in the future?

MR. WEISS: When in the dark of night, dozens of, for lack of better words, "silly" provisions are attached to major appropriations legislation that make a difference between keeping the government running or not, no president is going to veto those bills. Now, the Congress of the United States in both the House and the Senate have a rule that, translated into English, says something like: "thou shalt not attach positive law legislation into appropriations measures." They do not follow that rule. Talk to them about it. And I have got to tell you again on this political issue, they don't want to hear from your favorite grant agency. They don't want to hear from bureaucrats either. So, the scientific community is well advised not to rely on intermediation. You must make your case politically directly because we intermediators, for whatever reason at this point in time, are not successful.

DR. OERTEL: Okay. Well, that is very clear and I appreciate that. I am not quite sure how we are going to do it, but we will have to. Thank you very much. That was extremely useful. I would like to next go to Alan Gaines from the National Science Foundation.

Alan Gaines, National Science Foundation

DR. GAINES: I have prepared some viewgraphs to guide this discussion (see Appendix C). One aspect of this that I should put in perhaps as a caveat is reflected by my new title, which is Senior Associate for Spatial Data and Information. I am focusing specifically on spatial information. That generally means a geo-spatial reference. This partially reflects my earlier question to John Rumble, because a lot of our data are not numeric in the traditional sense, but, in fact, deal with imagery and so on. I know a number of you around the table from other groups also share this interest. So, perhaps, one plea I can make is that the realm of spatial, data and information, I think is also in a special class of information more generally and it does have some of its unique issues, as well as sharing all the issues of data, more generally speaking. However, I am not going to dwell on that in terms of what I want to talk about today. I will, in fact, speak very generally.

There are four data management issues that I would like to deal with briefly in turn. The first issue is multiple use, which, I think, will be common to most of us; the fact that data may have value beyond the purpose for which they were collected. This is one of the really exciting advances that has come up, particularly with increased and advanced communication capabilities that we now have, such the Web and so on. In fact, it has enabled the creation of new knowledge from those data when they may be used in a different context from which they were generated. This is really the heart, I think, of the interdisciplinary science and research and I think is one of the most exciting aspects of what is happening in the information revolution. I would state that in order for this kind of exchange or multiple use of data to be successful, the generation of metadata is absolutely essential. And I know that several of you will agree with this because it is sort of a mantra. But the metadata are required not just for discovery of the data by someone who isn't part of the information community, but, in fact, also for some initial valuation as to their usability or value in this new domain.

So, dual purpose, but metadata are a real key and I would say that there is even a requirement for the metadata to be extensible because when someone else uses those data, they may be able to provide an evaluation of the utility of those data in a different context. That evaluation should, in fact, be fed back into the metadata so someone else doesn't have to repeat the process. They can have some verification or validation of that particular use. This is going to be a kind of a recurring theme, I guess, because I think it is extraordinarily important.

Another issue is interoperability at the data level, where in order for the data to be useful, not just available, to communities other than those for which they were generated, they have to be interoperable and probably the first Holy Grail there is semantic interoperability. I paraphrase that by the question, What do you mean by semantics?, because even that term means different things to different people or different things to the same person in different contexts. We have semantics at the machine language level. We have semantics at the data level. We have semantics at the application level and at the human level or the disciplinary implementation. And it is not just a matter of new terminology. The bigger problem, in fact, is a word that is used broadly, but with different meanings. So, when somebody hears it, they understand it, but not necessarily in the way it was used. So, this is a real problem that needs work.

The key, of course, is standards for a model that enjoys a wide acceptance in use. I would say that thee things are necessary for the interdisciplinary sharing, but my second point is that they should really be minimal and the concept of chaordic as introduced by Dee Hock simply means enough structure or order to make the system work, but still allowing a certain degree of chaos or individuality in doing it. Possible solutions here will be things like development of XML as a language, UML for modeling language. These may be getting us there.

The next management issue is data quality. Again, this is a relative term, not absolute. Quality of the data in one use may be very different from those in another use. Data that are high quality for the use for which they were generated may not be as good for some other use. Again, there are multiple use concepts, which introduces new parameters.

The factors to deal with are uncertainty in the measurement, but also the total consistency, the uncertainty in scale— in fact, when combining data from different sources, they may be at different scales, and also, of course, the source. We desperately need to devise some way of representing the uncertainty in the data. Metadata carry that physically, but a lot of uses aren't necessarily going to go to the metadata or they may not be able intuitively to understand them. So, another one of the Holy Grails is to derive from visual or intuitive way of representing the uncertainty in data.

The fourth issue is archiving. This is something that we, obviously, all deal with. An archive is a single source. The traditional archiving function is to make sure that the data are refreshed so the internal consistency is good. But they should also be extensible. I mentioned that before, both in data and the metadata, but also in format and medium. If they are to be persistent, we need to be able to accommodate new media as they come along. Similarly, storage gives you the format and medium and I think we need to start dealing with issues of disposal. As our data get clogged, we are going to have to start feeding some of them out of the bag.

I am running short on time, so let me go quickly through three quality issues. I would like to introduce what I call protective sharing and this, perhaps, addresses the FOIA issue. I think that it is necessary to give some limited exclusive use of data to the people who are generating them. In science, this is one of the things that we run up against as a principal barrier to full and open sharing, and that is that people really want to be first to publish. And I think that most of us would agree that that is a reasonable expectation. So, some period of exclusive use should be allowed, then the data should be shared. There are other questions of costs of archiving and management of the data. In fact, I would make a real distinction. We often talk about free and open access to data. I would say that is distinct from full and open and that sharing can be done for a fee, as well as for free.

Questions of restricted access, there are some data that need to be shared but not fully for a variety of reasons. Following on that, a policy issue that is often not thought about is that of liability, particularly when the data are being used for decision support, decisions are made based on those data that may involve money, lives, property, et cetera. If the decisions are wrong, there will be a tendency for the user to blame the data source. So, I think liability in sharing data is an issue that needs to be addressed.

DR. OERTEL: Thank you very much, Alan. In the interest of time, I would like to move on. I hope, Alan, you can stay around in case there are questions in the general discussion later.

Our next speaker is Gerry Barton of NOAA.

Gerald Barton, National Oceanic and Atmospheric Administration

DR. BARTON: Good morning. I am Gerry Barton from NOAA, which is part of the Department of Commerce. I have two major issues and then some other things that I will cover this morning. A major problem is the growth of the archive and how we pay for it and how we manage it. For example, in terms of archive growth, early in 1990 NOAA started off at about 130 terabytes and have grown in 1999 to about 750 terabytes (see Appendix D). This happens in a number of different ways as some breakdowns of data sets. It is estimated that the total archive growth in 15 years, from 1999 to 2014, will grow to 20 petabytes. These are in petabytes now, not terabytes.

This growth rate through 2014 occurs for two reasons. The first one is major remote sensing systems. GOES is the satellite that we have had forever and it grows a little bit in the later years. The Doppler radar system that you are all familiar with is also included.

NOAA is also responsible for archiving data from the Defense Meteorological Satellite in our Geophysical Data Center in Boulder, CO. The NASA Earth Observation System also is expected to add tremendously to the data rate. In addition, smaller remote sensing systems and the polar orbiting satellites are also considered.

So, we have this major growth in radar. The problem is how do we archive the data? How do we take care of them and where does the money come from? NOAA’s budget for the data systems has been level for a number of years, level in today's dollars, so that it is actually going down in the constant dollar sense. We are not funded in the data centers. The base funding does not cover the budgets of the data centers. So, there is a lot of soft money that they keep looking for. Some of these soft money sources are drying up. It is a very difficult problem and then we have how are we going to pay for all of the data that are expected to come as I just discussed. So, budget issues are just paramount.

The second major issue, as Peter mentioned, are property rights and the commercialization of data. Peter already brought up the weather system problem. The data transmitted over the Global Telecommunications System used to be about 8,000 stations when we were playing with it very actively and that was about ten years ago. I think it is down now under 7,000 stations and it may be more. That is a major problem because these data are used for the weather models in the forecast that you see every day. The more the countries hold back their data, the less accurate the forecasts will be. We should be going in the other direction. So, it is a major problem.

NOAA also has problems with restrictions on data distributions, which Alan also mentioned. These restrictions of distribution and related caveats are forcing the data centers to look at individual data sets and impeding the release of complete data sets. We have to clean that out and release that just on a restrictive basis. It is creating major problems.


One thing that NOAA is doing—this isn't really a problem because we have some money for it, but if we lose the money it could be a problem—is data rescue. We have thousands and thousands of data sets of various kinds on magnetic tape, on microfilm, and on microfiche. I think we are pretty much done with the cards, but those are still some data in paper formats and we are getting those converted over to digital formats. We have been attacking that problem with several million dollars for the past several years and it is really paying off. I think approximately 88,000 reels of microfiche and microfilm have been converted over. So, data rescue is a major problem, but we are handling it and hope to be able to handle it better in the future.

Regarding data management for other things such as the spatial data, we work very strongly in NOAA, in the Department of Commerce, and in the Federal Geographic Data Committee (FGDC), but we have no money to really do very much. We have to take it out of our own budget. So, we are trying to get through the FGDC to get a coordinated budget for federal geographic data activities and we have a plea for 2001 to try to get some money for that.

Let's talk briefly about Landsat. NOAA will no longer be involved with Landsat. The spacecraft will be launched April 15th, and USGS will distribute the data. The data will cost about $600 for the first scene. After that, it is the cost of the media. This really attacks what was done in the USNC/CODATA’s report on Finding the Forest in the Trees. But it attacks a specific issue that was in there. Now there has been the commercialization of the Landsat data and the lack of research that resulted from it for a long time.

NOAA conducts a number of international activities. We now have a node of the Global Environmental Locator Observing System, which is an international project. We also work with the Committee on Earth Observational Satellites (CEOS) and other ongoing international activities. We also worked with GOIN, which was an activity between Japan and the United States. That project just ended, and it will be moving into CEOS. With regard to CEOS, Helen Wood is chairing the International Global Observing Strategy project that has a disaster Web page that shows how satellite data can be used for various disasters.

DR. OERTEL: Thank you. The next speaker is Hedy Rossmeissl, who is from the USGS, at the Department of the Interior.

Hedy Rossmeissl, U.S. Geological Survey

MS. ROSSMEISSL: I am going to talk about three issues that at the U.S. Geological Survey we think are very critical data management concerns for us and some of them have really already been articulated.

First, data archiving and preservation; second, data integration; and, third, access to data. For data archiving and preservation, I can parallel a lot of the things that Gerry just said about scientific data and the amount of data. At the USGS, we have scientific responsibilities for data in geology, hydrology, biology, and geography. Many of these data sets are national in scope and we see very large responsibilities growing in this area. As Gerry just mentioned, for Landsat 7, the U.S. Geological Survey is taking a much wider role in Landsat now than we had previously for actually managing the ground stations and also for the preservation of all of those data. So, we also had a lot of old data previously, but I think we are moving up into the petabyte level of data collection in this satellite arena as well. So, we were very fortunate last year and again for 2000 to have received some additional funds from OMB for some additional archiving. We have been working on that issue with OMB and we are hopeful that we can continue to add some additional funds to our budget for these responsibilities because, again, we see them as being absolutely critical for the scientific community to have these data sets. Long-term preservation is imperative and we are seeing for the older Landsat data, a number of sales of those data sets. So, they are fairly popular in the community for looking at long-term studies.

USGS is also working with some of the commercial satellite companies from the perspective of long-term preservation. As the data get older, they are less economically viable for those satellite companies, plus it is a big burden also for them to be holding a lot of data for a long time. So, we are talking with companies, like SPOT Image and some of the newer commercial companies that are putting up satellites, to try to make some agreement about, again, the longer-term preservation of those data. That will continue to be an issue for us as those data holdings continue to grow.

Another big issue related to the archiving actually concerns access, because we are finding that in the past it was enough to say that the data were archived. Now, with the Internet and other advances, people expect you to make it available to them very rapidly. That puts a different flavor on your archive. It is not just a matter of having the data and having them preserved, but now there is a whole different set of requirements there to have your archive more easily accessible. Again, with the volumes of data we are talking about, that is an extreme challenge for USGS to meet.

Another aspect of the archiving relates to real-time data on the Internet. We are presenting many data sets now, many of which are online and you get different readings every 20 minutes and that sort of thing. The preservation of those kinds of data, when you take certain time spans, and how you actually archive those type of data is, again, a challenge that we haven't had before because we hadn't been presenting data that quickly. So, those are actually some of the biggest issues that are puzzling us now.

Data integration is certainly another very important consideration. Again, from the USGS perspective, we have such a wealth of earth science information and now the biological resources component has been added to the Survey. So, we are working much harder within our organization to look at that integrated science approach in not just our data activities, but in our science and how we are approaching different problems. We have ecosystem studies and other studies that we are working with other federal, state, and local agencies. Regarding the integration of the data—some of these issues have already been talked about related to the content of the data formats, the accuracy, and how you look at putting those data together across a wide spanning set of partners. It is a big challenge.

As Gerry Barton mentioned, the documentation of legacy data sets is a major issue. USGS has our data in the shoeboxes and we are trying to recover and make them more widely available and also then to put those in a data integration perspective.

With regard to copyright and data restriction issues, one issue is the use of data from state and particularly local constituencies, which are looking at gaining resources and money from their data. Working with those agencies to get information that we can put in our national databases is a challenge.

Also, we have been trying to work more with the private sector in areas like transportation data, for example, where you have many, many companies now that are working in that area and they have other issues. We don't have the resources to have that kind of detailed data. So, we are working with them and trying to convince them that maybe lowering the accuracy of some of those data and getting them into the public domain is useful.

I already mentioned a couple of things about data access, particularly the expectations of the customer community now that they want data accessible on a much more timely basis than we have had to deal with before. As a government agency it is hard to retool all of your systems on an adequate basis to meet that need. We have been working on offering those data through the Internet, creating a multiple set of tabs for people to actually find the data easier.

Another issue that has been brought up before is the fees for access versus access without a charge. Our scientists really want to see the data accessible without charge. Some of us who are working in the data area, however, feel that if we don't try to work some new algorithms for some fees for those data, we are not going to be able to keep the infrastructure in place to be able to offer the data. So, we are dealing with that issue—it is not just buying a product anymore, it is how do you get some of those data access mechanisms set up.

We are also finding as we are talking to the private sector that they are willing to offer some of these data without charge so that they can then add value to them and offer more enhanced products. Sometimes the federal agency is a little bit leery about some of those arrangements, but we are getting a lot more interest from the private sector in offering data for them to consider value adding.

I guess that sums it up.

DR. OERTEL: Thank you very much. I think we are getting great information from all the speakers.

MR. WEISS: Another species of the cooperation with the private sector than the one you mentioned was your terraserver initiative. You might want to talk about it for a moment, explain what that is.

DR. ROSSMEISSL: USGS has a relationship with Microsoft. They approached the agency and were interested in offering our digital ortho photo data. Their interest was the fact that they wanted to build a very large database and show that the capabilities of their sequel server software could handle very large databases on the Internet. We had a very large database. So, we came together. It has been an excellent relationship from the perspective that USGS has been working with the research side of Microsoft, not the commercial side. There hasn't been a big issue about them wanting to make a lot of money off this site. And for us, we have had a tremendous amount of visibility for those data, a lot more interest in them, much more widely known and used now.

So, again, we are finding that other companies are coming to us now and are willing to do some presenting of data, again, for more of the public good. I think that is a good turn of events that we are seeing. Microsoft is now interested in moving that terraserver to their ENCARTA online activities. So, we will see how this evolves down the road.

DR. OERTEL: Breaking new ground. That is very interesting.

DR. RUMBLE: When federal data managers worry about some of these issues like archiving and interchange and things like that internationally, how do you do that? Do you just have a limited number of partners, like the European Space Agency and Japan Space Agency, that you deal with, or what mechanism do you use?

DR. BARTON: There are many, many different ones. It just depends on the situation. For example, one that I mentioned earlier was an agreement between our president and the prime minister of Japan. So, it was at that level.

DR. OERTEL: I would like to move on to Bob Shepanek from EPA.

Robert Shepanek, Environmental Protection Agency5

Overall EPA Direction

"EPA’s data resources represent one of the Agency’s greatest assets. As a national Federal source of reliable and comprehensive statistical information on the state of public health and the environmental, EPA is uniquely equipped to provide the public with critical tools to pursue responsible policies." Browner and Hensen, EPA Reorganization Memorandum, February 1997

 

Strategic Drivers

Practice of Environmental Science is Changing

Technology is Evolving

Challenge to ORD: Leverage technology to meet the needs posed by changes in the practice of environmental science!

 

Scientific Information Management Challenges

Technical Challenges

Management Challenges

Cultural Challenges

 

Cultural Challenges

Influence change through scientific societies and peer pressure – CODATA efforts on H.R. 354 and A-110

 

Management Challenges

Commitment of adequate resources for systems development, operation and population

Support for related policies and procedures

Appropriate incentives for involvement by staff and project participants

See USNC/CODATA report, Finding the Forest in the Trees (National Academy Press, 1995)

 

Technical Challenges

Improve access and documentation of data and information relevant to environmental scientists in other countries

 

EPA ORD Activities

Re-inventing ORD as an Organization Engaged in Science Information Management

Scientific Information Management Policies, Procedures and Standards

Outreach

Development of an Advanced Architecture

 

Additional Information

 

DR. OERTEL: Thank you, Bob. Talking about your outreach point, I am glad you are here to liven things up and it is very valuable. The next speaker is Wanda Ferrell from DOE.

 

Wanda Ferrell, Department of Energy

DR. FERRELL: I am in the DOE Office of Biological and Environmental Research Branch, where we do climate change and genome research (both human and microbial). The human genome work is done in cooperation with NIH.

Our largest global change program is the Atmospheric Radiation Measurement (ARM) Program. The objective of ARM is to improve the parameterization of the role of clouds in GCMs. The observing capabilities for ARM are located in three geographic regions: the U.S. Southern Great Plains, the Tropical Western Pacific, and the North Slope of Alaska. The Southern Great Plains site is the oldest and largest with over 200 instruments; we have named it a climate observatory. The North Slope of Alaska site is located in Barrow with a second one coming on line later in the year. The Tropical Western Pacific site comprises stations at Manus and Nauru with a third to be added next year.

In the ARM program, data transmission is a problem, since we are using a distributed data system. With the exception of the tropics, we have developed site data systems for processing. The site data are then sent to the experiment center, located at the Pacific Northwest National Laboratory, for some additional processing before being sent to the ARM Archive. The Archive provides access to the general science community.

DOE does reserve the right to charge if someone were to ask to empty the archives, but beyond that, data are available at no cost. The issues I will address here are intellectual property, resources, and metadata. Concerning intellectual property, I only exhort you to continue your good work in tracking this issue. This is an area that presents a lot of potential problems for the science community, and I think CODATA has been in the forefront in raising the flag for the agencies and the general community as well as addressing the problem. You have kept us apprised, been way out ahead of the rest of us. So, any help that we can give you, we will be happy to lend our support.

Resources for data are always an issue. Specifically, in the genome program in the next 18 months there is expected to be a huge explosion of information. I am sure NIH will speak to this as well. The big push to complete sequencing is going to create a huge bulge of data flowing into the system. We have to plan for not only storing these data, but also providing useful search tools for finding the needle in the haystack.

A general issue is how to package information so that it is useful to the primary scientific community, as well as to secondary users. We cannot always anticipate the audience for our data products. Thus, we have to respond to needs identified by the broader user community.

User expectations drive the resource requirements. When technology changes and as technology changes in the general community, people expect the data systems to respond accordingly. Similarly, success brings new requirements. For example, in ARM a few years ago we created special data products design to address high priority scientific questions. As the usefulness of these early products was demonstrated, it created a demand for more. Staff are dedicated to creating new special products and to modifying existing products.

It is a common misperception that you build a data system and go away and leave it. A data system requires no enhancements. But users expect a system to be responsive and to be updated to include new technology developments. The Web has been the basis of growing user expectations over the past few years. We now have to respond to these expectations and to respond in a flat budget environment.

Metadata is a very important issue for ARM, particularly for the area of data quality, which is our highest priority. In ARM, how to tag the data files is a major issue, since we currently produce over 2 gigabytes a day. As the sites grow and as we produce special data products, this number will grow dramatically. We are not like the satellite systems that have huge files. DOE has lots of small files. So, this produces a problem for us that is not usually addressed by data management committees.

Another issue is constantly developing new Web tools so that we can search the databases that we have.

My last issue is a personal one and not a DOE position, that is, we should have some means of publishing data sets or at least having some process by which data sets can be cited. Yesterday, the committee discussed how to track the use of data sets. Citations for data sets answers this problem. Publication of techniques also provides documentation for processing. Publication by people in data management helps with career advancement and recognition. We need incentives for technical people to work in data management. Our data centers employ a mix of computer scientists and physical scientists to produce needed data sets. But for physical scientists it is important to their careers to publish; thus, it's difficult to recruit without this incentive.

DR. OERTEL: Thank you, Wanda. If there are questions to Wanda, then we’ll move on to Richard DuBois from NIH.

 

Richard DuBois, National Institutes of Health

DR. DUBOIS: One nice thing about being one of the later speakers is that a lot of these issues have already been discussed.

I am with the National Center for Research Resources, which is a freestanding center within the NIH. We specifically are not data handlers, but we do support a great deal of research activity that deals with the handling, the generation, the storing, and the analysis of data. So, we feel we have something to say here.

Basically, I want to start with this slide that says "In recent years it has become evident that only through the integration and analysis of heterogeneous data will it be possible to truly understand and control disease." It wasn't until rather recent times that this was obvious, but it is obvious. Because of what I am going to say later, it will become obvious that this has been a major problem for us because of a number of databases. For example, there are the crystallographic databases at Brookhaven and Cambridge. The Brookhaven Crystallographic Database has over 10 gigabytes. However, where we really get into high level data is in the imaging databases. For example, we have a site at UCLA that currently has over 2.2 terabytes of storage. That is a lot of storage.

But one of the areas that we are involved with, the one with DOE, is in the GenBank database, which is over 13 gigabytes of which the human genome component of that is 6.5 gigabytes and the other, of course, with these various other areas like yeast, Drosophila, which is fly-based, C. elegans is the worm based, and then mouse. Actually, the mouse gives new meaning to are you a man or a mouse.

I want to show this slide which displays volume sizes by resolution for the brain. This is from the resource I mentioned before at UCLA. If you look at a typical brain, it is around 50 or a hundred cubic centimeters, if you have a voxel size of 1 centimeter, which would be the resolution, you end up with a 4.5 kilobyte of database. But as you move to a millimeter, it jumps up by a factor of 10, which you would expect. Well, when you get to the 10 micron area, it is 4.5 terabytes and then if you get to a 1 micron area, it is in the petabyte range. The research group has digital cameras measuring in the 3 to 5 micron region and assuming at some point we are going to want to be looking at cells and that is in the 2 to 4 micron area. So, as you can see, the amount of data that are going to be collected in these areas is immense.

So, where are we going? Well, just looking at the genome, the next step once you get the genome is to start looking at the proteome, that is molecular function. What do these proteins do? The ideal access infrastructure has to be put together as sort of a middle ware development. Not only do you have to develop this middle ware, but it has to be transparent to the user because most biologists don't want to be computer scientists.

Finally, there is the development of data mining, which we have heard about before, integration and modeling situations. Data mining it is a fairly newly coined term. As I understand it and we deal with it, data mining has to do with looking at data with the goal of developing new relationships, but usually without any particular target at the outset. So, you can imagine how difficult that is going to be. I mean, the kind of software that is going to have to be developed to be able to do the data mining is not going to be easy to accomplish.

Going a little bit further, although people are constantly thinking about this already, is getting to the cellular and organ function and physiome. Now, this will require extremely complex models and simulations. And it is only going to be created from using heterogeneous data sources. There is just no way they are going to be able to even to begin doing this. Certainly, more sophisticated data handling and visualization tools will be needed, including the use of expert system applications; that is, so far, artificial intelligence. You are going to have to be able to get help from well-crafted software that provides this expert capability.

So, from our point of view, what are the management issues? Well, these are not all of them, but it is certainly the important ones. We are very interested in standards. It is very important that databases in the future, even today, be able to talk to one another. That is happening, but it is happening slowly and not very efficiently in many instances.

Training is a major issue. It turns out that the people who do all this work that I have been talking about are scarce and those that are good and even some that are not so good get hired up by the pharmaceutical industry very, very quickly. The other problem is that in the university environment, those kinds of people are sort of between a rock and a hard place. They are not really computer scientists and they are not really biologists. So, they don't really do very well in that environment. Those issues have to be addressed. Quickly, with regard to policy issues, we are concerned with the issue of accessibility and the whole area of intellectual property and that has been discussed somewhat. Then security is a major issue with us. With clinical data especially, security is vital. Even the researchers want their data to be secure.

MR. WEISS: Are you talking about data integrity, privacy, or both?

DR. DUBOIS: Both, especially with the clinical data.

What can CODATA do for us? Well, based on what I have heard this morning, we certainly need standards and to the extent that this group can help us get these standards, we would welcome that. There is no question about that.

            Finally, accessibility has become quite a problem. We reviewed a resource recently and one of the problems they had was to look at sudden cardiac death syndrome. They want to look at these data. One of the good databases for that happens to be in Italy. The Italians, for some reason or other have good databases. However, their data are not free. This is the sort of thing we are worried about.

MR. CHINMAN: When you speak of needing standards, I assume that within your immediate community that you have a standardized way of labeling things, but the standards that you are talking about have to do with computer standards or format standards. Your community needs to get together and define its own set of standards and then there is --

DR. DUBOIS: On some level that is true.

MR. CHINMAN: Then there is the CODATA involvement that could help --

DR. DUBOIS: There could be -- yes -- sort of a push like what Dr. Vaitukaitis mentioned earlier, you know, the medical community is not that eager to jump at standards. Some people look at that as a threat in a sense.

DR. VAITUKAITIS: The other thing we need is appropriate laboratory tools for access to high-end technologies, but also laboratory tools for the Internet.

DR. GAINES: I would make one comment on the visualization of large data sets. Yesterday, NSF released a new program announcement dealing with exactly that topic and, of course, we always welcome partners in this. So, if any of the other agencies have interests in developing advanced techniques for visualization --

DR. GERSHON: I might add very quickly that it is too late, too little.

DR. DUBOIS: We are really aware of the problem and, in fact, we are in the process, hopefully, of establishing one, possibly two, centers in that area within the year.

DR. OERTEL: We’ll now hear from Elliot Siegel of NLM.

 

Elliot Siegel, National Library of Medicine

DR. SIEGEL: Peter and I just had a wonderful conversation. I think we have, I hope, come to a meeting of the minds that bureaucratic entrepreneurs are not oxymorons. Some of my best friends, including myself, are motivated by trying to work smart and making a difference.

MR. WEISS: With the exception of certain people in European weather services.

DR. SIEGEL: Let me tell you what I am not going to talk about. I am not going to talk about health data cards. I am not going to talk about genomic databases and I am not going to talk about toxicology or environmental health, all those wonderful things that the entrepreneurial bureaucrats have been working on over the years.

I will talk to about something you probably don't know much about and the issue was raised yesterday about the danger of there being a greater gap between the haves and the have nots in terms of accurate information. That is a project that NLM has been involved with now for almost two years on behalf of the NIH and it is a project that Dr. Harold Varmus, director of NIH, is very keenly concerned about on a personal level.

NLM is working on a Multilateral Initiative on Malaria. I have given you a handout that is an overview of what NLM is doing (see Appendix F). What NLM is attempting to do is to work with the fundamental objective of this multilateral initiative to enhance the capacity of African scientists to do research in Africa.

There was a conference held in Dakar, Senegal, in January, 1997 that looked at the whole basic problem of ability to do research in Africa in general, not just on malaria. What was identified there was the opportunity to work with the malaria community, both in terms of research and control site because of the tremendous havoc that this reemerging disease was wrecking on the economy and on human health. Three million people a year are estimated to die each year in Africa alone from malaria. Most of these are women and children.

The scientists who attended this meeting were very quick to point out that one of the things that they feel a great lack of is the ability to communicate with other scientists in Africa and with colleagues around the world, particularly in Europe and in North America. NLM stepped forward after that, at the suggestion of Harold Varmus, to play a role in this. We have been very active in outreach domestically in rural areas in the United States for over ten years and we are certainly interested in international activities. So, we created a communications working group within the overall structure. I chair that group and we play a catalytic role. We plan an advisory role and lately we have been playing a funding role, too, because, frankly, you have to put your money on the table if you would like to get additional monies. The fundraisers know that and sometimes if you want to get something done, you have to pay for it.

So, we advise, we coordinate site visits. We do technical consultations. We fund, as I say, appropriate communications equipment purchases, installation, and training. We are very concerned about library infrastructure development. Once you get access to information, you need an infrastructure that can handle knowledge management.

We are also interested in evaluating what we are doing as a model for capacity building in telematics in developing regions of the world. So, I am speaking not just to CODATA's interest, but to the NRC that also has a great concern about the situation in Africa and other developing regions, and specifically involving the use of telecommunications technology as a means to improve the situation.

The bottom line is how to improve communications between scientists. We want to improve access to needed scientific information, from libraries, electronic as well as printed media, more databases and all the wonderful things that are on the Internet.

We have developed a plan for doing this. We didn't say we are going to solve all of Africa's problems. We can't do that. We try to do one problem at a time, one location at a time, one site at a time. The strategy we have undertaken is to work with malarial research sites and where you can expect that there will be an opportunity for sustainability once the work there has begun. We are looking for locations where there are what we call champions on the ground, people who you can work with. You can't orchestrate this from Bethesda and you can't do this simply with consultants. We need to have people locally who are committed. So, we seek out those people and those sites. We go there and visit. We go through the muck and we find what needs to be done and we try to get it done.

We found that things have changed dramatically over the past two years in terms of Internet connectivity. Most countries in Africa with the exception of one, Eritrea, have an Internet gateway, providing full Internet service. The problem is it sits in the capital city and it doesn't necessarily go to the places you want it to go and more often than not, it doesn't go to the malaria research sites.

We are in the business of providing that last mile connection to the malaria research site. And this can be done by microwave link. It can be done by cable. It can be done by satellite. When you are dealing with satellites, you get into other sorts of problems, which I will address in a moment. But what we are basically dealing with is where connectivity does exist, it is generally unreliable. It costs too much and it doesn't provide you with the kinds of things you need to do if you are going to have access to the modern data resources we are all familiar with.

We want to provide access, of course, to MEDLINE and other services that are available, such as BIOSIS, GenBank, or other genomic databases. We have been working with Lois Blaine most recently on a malaria research repository, which she will probably tell you about.

We have been working with folks who have been doing geographic mapping that track population movements and the movements of mosquitoes and they use NASA satellites for this. So, there are a lot of people involved and there are a lot of organizations involved. That is kind of what I wanted to emphasize because this is very much a partnership, a multilateral effort, that involves a lot of different organizations.

Working with these organizations, the funding organizations are very important because what I didn't mention—this is critical to the strategy we have adopted—is we want to inculcate within the research community the need to fund communications. This is not a problem for NIH; we do this. However, it is a problem for organizations such as the Institut Pasteur and Wellcome Trust; these groups ought to fund access to communication resources as a cost of doing research. It is part of the research enterprise. That is sometimes a hard sell, but if you can't get that done, you are not going to get sustainability. Part of what I find myself doing is not just worrying about satellite connections and how many kilobytes you can transfer, but actually working with these folks to convince them that this is something they need to pay attention to and they need to fund.

The partners, as I have mentioned, have been the African scientists and the governmental agencies. They have been the donor agencies, including my colleagues at NIH, Pasteur, Walter Reed, Wellcome Trust, World Bank, and USAID. I have also brought to the table colleagues who are involved in the international medical library community in Africa, and also the telecommunications community.

I have here a list of the places where we have been and the things we are doing. I am not going to run through the details of this. You might look at this and say NLM is acting as the phone company. We are not simply plugging lines into walls, but you do have to deal with the technology and you have to deal with the politics. We have learned that in some parts of the world, in Kenya, for example, we are all set to go. We have done some wonderful work on the ground. We need to get a piece of paper signed that says we can install a VSAT (very small aperture terminal) ground station. We have got verbal approval. I don't have it in writing yet. Until I get it in writing, I can't pay for it. I will give you some dollar numbers so you know the order of magnitude I am talking about. It cost basically $25,000 for a ground station, about $25,000 a year to operate, but compare this to the fact that the phone bill in many of these places is $30,000 a year. So, it doesn’t cost that much more. We hope that there will be other technologies that will come along. We have heard about Iridium and Teledesic. Right now, they are geared primarily to well-heeled business people and governments. That technology is not there for the kinds of folks we are talking about working with. We hope that will change.

We hope that the NRC can exercise some persuasive power. We are interested in becoming a part of that effort. In the meantime, we have to deal with what we have available now. We are working in Kenya, as I mentioned. We have done some great things in Mali, which has been a recipient of a lot of NIH largesse over the years. We are ready to go in Tanzania in terms of work there. We need a funding partner that can pick up the costs of their research and their communications. We are working in building up library sites.

I realize I am running out of time. The concept of DOCLINE libraries. That is probably not a term that is terribly familiar to you but we want to encourage inter-library loan and we want to build up libraries in Africa. So, we are working with the South Africans. We are going to be working with a pretty good library in Harare, Zimbabwe, and we are trying to build up the medical library in Mali, so that they can serve western Africa, Zimbabwe can serve central and eastern, and South Africa can serve southern Africa. We envision NLM serving as a back up, but basically the Africans will be able to help themselves.

We just got back from a meeting in Durban, South Africa in March. Harold Varmus and Don Lindberg were there. I was there. We had a nice contingent. Nine hundred people were there. And we have made some wonderful contacts in terms of new places to work, in Ghana, Nigeria. We were finalizing plans with the Medical Research Council in South Africa. The British Medical Association was on the phone with me two days ago. They want to help. They want to provide documents.

The commercial publishers are not very eager to help and that might be something that you can address. So, what I have laid out for you is basically an open invitation. This is what we are doing. You are aware of the data needs. There is a need for the data that we have. The question is how to get them there, how to deal with technical and political impediments, and how can we get enough people working together so that this can really make a difference.

DR. OERTEL: Thank you very much. I would like to move right along to Pamela Andre from the National Agricultural Library.

 

Pamela Andre, National Agricultural Library

DR. ANDRE: I would like to begin by sending regrets from my colleague, Dr. Judy St. John. Dr. St. John is responsible for overseeing plant research at USDA Agricultural Research Service. Plant genetics is a really big part of that activity and she was looking forward to talking to you a little bit about what is going on. Unfortunately, she couldn't. I just want to say that she is very interested in continuing involvement with this group. I have collected materials here. I will make sure that she gets copies of all of these.

I will move on and talk about the library and information as distinct from the data side. I want to talk specifically about an activity going on at the Department of Agriculture related to the preservation of USDA digital publications. As I have listened to all of you around the table, it is very clear to me that USDA is at the very beginning of many of the activities that you are very deeply involved in. And as I have heard you talk about the various issues from technology to copyright to user access, those are the very things that we are beginning to deal with. And I want to say that by way of warning that I have already made notes and I would like to invite a number of you folks to come and talk with us as we move through this process at USDA.

You may be thinking about what has a publication got to do with data and electronic data resources. I would suggest that at USDA we are at the very beginning of dealing with agency policies and procedures relating to both electronic publications, as well as electronic databases. In taking this approach relating to digital publications, the expectation is that we have opened a very small cover on a problem that we believe is, in fact, manageable in that the policies and procedures coming out of this activity will, in fact, then be broadened to include the broader database community within the department. So, there is, in fact, a connection, although it may be a time before we get there.

This whole initiative really began about two years ago with a conference that we pulled together here in Washington, relating to the preservation of USDA's digital publications. That conference was very much driven by users of those publications, everything from the agricultural research kinds of publications, to economic statistics and all of the agricultural statistics that really drive a whole lot of the agribusiness community in this country. People were concerned that as more and more USDA publications were coming out in electronic form, what was happening to the old version of those materials.

At this conference, we talked about the wide range of issues that many of you have identified here, everything from what exactly is there at USDA that is published in electronic form, the inventory kinds of issues, through what are the technologies that we need, what are the archival procedures that will ensure not just that this material is somewhere in an archive, but that it is in fact accessible. You have heard a number of folks talk about those issues around the table. The result of that conference was, in fact, a framework document that I have here and I would be happy to share with any of you. It is also available on the Web. It is entitled the "Preservation of and Permanent Public Access to USDA Digital Publications." I have to acknowledge Paul Uhlir as our consultant on this activity.

That activity took place in the spring and summer of 1997. As you may be aware, USDA is a very large department. There are over a hundred thousand employees and the idea of getting such a framework document approved within the department was quite a challenge, as you might imagine. Luckily, we have a chief information officer (CIO) at USDA who is very well aware of what the department is lacking with regard to managing this electronic data and information. She took the initiative to approve this framework document and to establish a steering committee to move forward to develop the policies and procedures needed within the department to ensure long-term access to these publications.

I was given the responsibility of chairing this committee and we now have in place a committee of about a dozen folks, representing agencies within the Department of Agriculture, who want to publish materials based on their various activities. We have a number of stakeholders from the research university community, from the agribusiness community, as well as some of our key federal colleagues as well, such as the Government Printing Office and the archives. We have come together to begin to grapple with how do we move this forward in a department that is as large and as diverse as the Department of Agriculture. Basically we are focusing on three things and, again, it was part of the outline in the framework document driven by the user community, the inventory as I mentioned. What are the things that are available in the Department of Agriculture that need to be retained long term and what is the life cycle process for managing those things?

I will tell you that in some of our preliminary discussions with key agencies, they are very concerned about the version of publication that is available today and then they are concerned about the version that is going to be available next week or next year. They are totally unconcerned with the version that was available last year. I mean, there does not seem to be an understanding of the importance of the continuum, the historical information related to progress in American agriculture. So, I have to say that one of our first activities is really an educational one. Following that, as many of you have noted, are the technical requirements: what does it really take to archive and continue to provide access to these resources once we have identified what they are?

Again, I have heard a number of you around the table talk about large-scale data sets and how you are managing them. Just be warned that I am going to call on you to come and talk to us about that because I think it is very important that folks at USDA hear from those of you in the federal community who have begun and have a continuing commitment to these kinds of activities.

The last item that we are really dealing with relates to user access. What good does it do to preserve and archive material, if you can't give the folks that need it access to it?

Those are basically the key activities that we have underway. Included in this, as you might expect, is the whole issue of metadata, how do you describe these resources to facilitate access? I think it is going to be a very long haul. As I said, education is a key part of this activity, making sure that people in the department understand and acknowledge the importance of it. I should note that at this point there is no budget allocated for this activity. So, what we have is a group of very committed volunteers who think it is important enough that they will basically take on the work, at least the preliminary inventory work to try to move it forward. So, we are just beginning, folks, and I am delighted to hear that so many of you around the table are moving ahead so quickly and we hope to take advantage of that. Thank you.

DR. OERTEL: Thank you very much. I am sure we will individually, as well as collectively, help where we can.

DR. RUMBLE: Pam, where in the USDA are the people who worry about nutrition databases, say the nutrition composition of foods, especially on the varietal level, not just rice, but the 50 different rices and things like that? Does the National Agricultural Library have something to do with that?

DR. ANDRE: Not really. There is a Food and Nutrition Service that has that as part of their mandate. Of course, the Agriculture Research Service is doing research in the area of nutrition. They are, in fact, developing those kinds of data sets. But issues relating, as I said, to the long-term archiving and the ongoing management of those data sets, there is no clear policy within the Department of Agriculture related to that. Whether it is research that is going on in forestry, in soil conservation, in nutrition, in plant genetics, and all the rest of it, there is no departmental policy relating to how that electronic information needs to be managed.

DR. RUMBLE: What about international cooperation with respect to the databases, are you aware of a central point? Does the CIO of USDA worry about that?

DR. ANDRE: No, the CIO doesn't worry about that. Being a relatively new position in the Department, the office really worries about its own stability at this point.

It sounds like that is not a new concept. But I have to say that USDA, as a Department, is a very diverse group of agencies. There are 29 agencies within the Department and traditionally they have all been extraordinarily independent. Within the Department of Agriculture there are no standards relating to like kinds of research. So, there are very diverse, very independent agencies who have difficulty thinking that there are others who are doing comparable things. Perhaps there should be stronger collaboration.

DR. OERTEL: Thank you very much. I am curious, all kind of forms of bureaucratic entrepreneurialship. The next speaker is Joe Bredekamp, from NASA.

 

Joseph Bredekamp and Lola Olson, National Aeronautics and Space Administration

DR. BREDEKAMP: I am from the Office of Space Science and to just give you some context we have a very simple mission statement which is to "Solve Mysteries of the Universe, Explore the Solar System, Discover Planets Around Other Stars, Search for Life Beyond Earth." Yes, it is a rather awesome statement of a mission, but actually it is terribly exciting and terribly fun and right up front, I might say one of the things we are committed to in Space Science is sharing the excitement of that scientific endeavor, not only amongst the research community, but with the public at large. I think one of the best examples of that are where we really want the public to participate in our missions. Certainly, the Mars Pathfinder Mission is the best example of that. Those images were made available around the world and a few of them actually before the Principal Investigator (PI) saw them. He did take a couple of hours’ rest. That is a benchmark that would be tough to duplicate, but that is what we hope to do with Saturn; literally have the public participate, not just view, but literally feel a part of.

We are data rich, data intensive, and I will kind of give you a feel for that, not so much the data volumes, but just the operating missions (see Appendix G). This vu-graph is an eye test and the idea is not to go through each of these, but to give you some sense that it is a terribly exciting time because we have a very rich set of missions operating, under development, and to be operated, as we head into the new millennium. There is just a number and diversity of missions that we have and they range from great observatory missions to small PI-class missions that are literally operated from universities and private institutions. It is very international, running from flying instruments on foreign and international spacecraft to flying their instruments on our spacecraft to collaborating jointly and operating a wide variety of missions, small and large. It runs the gamut.

Data management. We really do feel strongly that the reason we