International
and Domestic Data Issues Discussion:
Scientific
and Technical Data Needs and Opportunities
April 9,
1999
TRANSCRIPT
INTRODUCTORY REMARKS
DR.
OERTEL: Good morning and welcome to the U.S. National Committee for CODATA’s
International and Domestic Issues Discussion on Scientific and Technical Data
Needs and Opportunities. I am Goetz Oertel, chair of the U.S. National
Committee (USNC) for CODATA. I wanted to first of all welcome the people who
are here from the various federal scientific agencies including: Peter Weiss,
Office of Management and Budget (OMB); Alan Gaines, National Science Foundation
(NSF); Gerald Barton, National Oceanic and Atmospheric Administration (NOAA);
Hedy Rossmeissl, U.S. Geological Survey (USGS); Robert Shepanek, Environmental
Protection Agency (EPA), Wanda Ferrell, Department of Energy (DOE); Judy
Vaitukaitis and Richard Dubois, National Center for Research Resources at National
Institutes of Health (NIH); Elliot Siegel, National Library of Medicine (NLM)
at NIH; Pamela Andre, National Agricultural Library (NAL) at U.S. Department of
Agriculture (USDA); Joe Bredekamp and Lola Olsen, National Aeronautics and
Space Administration (NASA); Kurt Molholm, Defense Technical Information Center
(DTIC).1
Before
we begin, let me call on Paul Uhlir, director of the USNC/CODATA, who will give
a thumbnail sketch of CODATA.
MR.
UHLIR: Thank you all for coming. I think most of you are fairly familiar with
the work of the National Research Council and with this committee, but for
those of you who aren't, let me give you a quick background. This forum was
convened by the U.S. National Committee for the Committee on Data for Science
and Technology (CODATA). CODATA is an interdisciplinary committee of the
International Council for Science (ICSU), which is a non-governmental
organization. The U.S. National Committee for CODATA is the U.S. member body to
that group. There are 31 other scientific unions and interdisciplinary
committees affiliated with ICSU, and the National Research Council administers
all the U.S. committees that adhere to ICSU. Many of the U.S. national
committees actually are not the typical kind of NRC committees that conduct
studies and produce reports and do other kinds of substantive work. They really
serve more of a liaison function, for the U.S. scientific community and the
international ICSU unions. They provide support for those unions and send delegates
to the international meetings. In the case of the USNC/CODATA and some of the
other interdisciplinary committees, such as the International
Geopshere-Biosphere Committee and other interdisciplinary environmental groups,
they actually do conduct studies or hold workshops and conferences under the
National Research Council auspices, in addition to the international liaison
kinds of activities.
I
believe all of you have received a brief one-pager along with your letter of
invitation, which gave a summary of CODATA and the U.S. National Committee, and
listed some of the major activities that the U.S. National Committee has
undertaken since 1991, when I took over the direction of the committee. I am
not going to go over that material, which noted several studies and a national
conference on scientific and technical data exchange and integration. We are
now planning another data conference for March 2000 with our sponsors and we
also are planning to hold an international CODATA conference and general assembly
in 2002. In addition, we are planning for next year bilateral meetings with the
Chinese National Committee to CODATA, which would be held in conjunction with
our data conference next March and then in China later in the year. So, those
are three activities that we are planning now. However, we are also interested
in hearing your views on the important issues that federal data managers are
facing on an interdisciplinary data management and policy basis.
Unless
there are any questions, I will turn the discussion back over to Goetz Oertel
and then to John Rumble, President of CODATA, to talk about the international
CODATA activities, which are distinct from the U.S. National Committee
activities, but which we, of course, help coordinate and support.
DR.
OERTEL: I will continue with this very brief introduction. I just want to tell
you that the USNC/CODATA is here to communicate with the federal data managers.
We are here to hear from you and to listen to your concerns and issues that you
may have. We have asked the invited data managers to present their top two or
three concerns, and it will be extremely interesting to see how that works out
when we compare notes with the different agency speakers.
Of
course, we want to see how CODATA may be able to help on the issues and
concerns that you are facing. The time is auspicious because the creation,
transmission, and use of data continues to increase at an exponential rate. Sea
changes are taking place in the way we do research, and in the legal and
regulatory environment and the business environment. More of that is to be
expected as this change continues. It is a very exciting time in this area. It
is also a time we are probably in a position to be more effective than ever
before in CODATA because CODATA has just elected John Rumble, chief of the
Standard Reference Data Division at the National Institute of Standards and
Technology, as the new president of CODATA. John is committed to making CODATA
a force to be reckoned with and to be active and to influence things on an
international level in our common interest, in the interest that we share with
scientists in other countries.
So,
I promise that we will pay close attention to what you say and we will give you
feedback. We are going to have your presentations recorded by a transcription
service, in order to enable staff to put together a complete and concise
summary of the presentations. We certainly don't want to lose any of the good
points that you will be making, which could happen without a formal transcript.
Let
me also say that a general discussion will be very welcome. We, therefore,
would like you to stay within the allotted time of five to ten minutes. I will
give a first indicator to speakers after about six minutes and a second after
eight and the hook after ten in the hope that we will have some time left over
for discussion of some of the key issues.
I
would now like to turn the discussion over to John Rumble, who will tell you
what international CODATA is up to.
INTERNATIONAL CODATA ACTIVITIES2
DR.
RUMBLE: I want to say a few words about CODATA and then a few words about why
we had this session. CODATA, as Goetz and Paul both pointed out, is an
interdisciplinary committee, conducted under auspices of the International
Council of Science. It is about 30 years old and its interest is in improving
the quality, reliability, management, and accessibility of data of importance
in all fields of science and technology. "Data" are not defined as
the full range of scientific and technical information, not full text, not
effective indexing, but quantitative data that are important in science and
engineering.
CODATA
really is a group of people, who have a great deal of expertise in working with
data in all these different scientific areas. This is meant to be inclusive,
not exclusive; there are other areas, including the social sciences, that
actually handle quantitative and numeric data and CODATA is interested in them
all. We have participation from virtually every area of science of people who
are very interested in doing things with data, whether just collecting them for
two or three years or analyzing them, manipulating them, providing access to
them—everything that you do with numeric data to make them useful to scientists
and engineers throughout the world.
CODATA’s
objectives are really recapitulated on the last two slides (see Appendix B). We
are focused on improving the use, improving the accessibility, and improving
the quality of data. How you actually do that is by fostering the international
cooperation among people throughout the world who are interested in scientific
and technical data and also by promoting the importance of the use and
collection and access to data. That is really how CODATA accomplishes our
goals.
CODATA
provides an umbrella organization for international or multinational data
projects to work under. We work on standards— either formal or informal
standards for collecting, comparing, and exchanging data. The buzzwords, of
course, today are data integration and data exchange. We traditionally have
provided information on directories to data, on building directories. We also
have done some education and training and that takes place both in the form of
workshops and also special outreach courses. In particular, we hold a biennial
conference. Our last one was convened in November1998 in New Delhi and the next
one will be in Italy in the fall of 2000. These conferences provide, I think, a
remarkably vibrant forum for bringing together people who are interested in the
management of scientific data to share ideas, to meet new people, to start
developing bilateral relationships, which lead to very useful and important
data projects, and, finally, to prepare key data sets, such as tables of
fundamental constants.
There
are 23 member countries plus the Academy at Taiwan, who are the primary members
of CODATA and membership is on the basis of countries. Each member country or
the academy has a national committee similar to the U.S. National Committee,
which provides a focus for CODATA activities within that country and also it is
a source of talent to help administer the program. CODATA also has many
liaisons with the international scientific unions, such as the International
Union of Pure and Applied Chemistry, International Union of Biological
Sciences, and so on and so forth. And we have some supporting organizations and
we have task groups, which I will mention in a minute.
Now,
why is CODATA becoming more and more important now? Why have we tried to gather
you here today to talk about your data needs and what CODATA can do to help
address them? That is because, of course, of this incredible information
revolution that is going on in science and technology. First is the Internet.
We don't have to say anything more about it, but it is just totally changing
the way we do the science. The Internet is facilitating international
collaboration. Finding data and information is much easier than ever before
through this medium, though we don't always know the quality of that information.
The
second factor is, when I started in this business 25 years ago, the first thing
I had to do was write my own database management systems and it is a task that
I can assure you you don't really ever want to do. Today, anyone can get access
to computer tools that are powerful, robust, and operated in all sorts of
environments just simply by buying commercial software packages. So, the
ability for a scientist or an engineer or a group of scientists or engineers to
collect data and manipulate them using modern information technology is not
trivial, but it is so much easier than it ever has been. That means that
researchers, who are constantly looking for better ways of doing their work,
now do have better ways of doing that.
The
third factor, and perhaps even more important to me, is the realization that
with modern computing, networking, mathematics, and physical theory that we are
really able to do the modeling and simulation of extremely complex phenomena
and to get very usable scientific and engineering predictions. But,
furthermore, we also realize that for modeling and simulation really to work,
you need good data. Garbage in, garbage out is probably never more relevant
than today.
Obviously,
this revolution is making it easier to do data work and it is easier for data
work to begin to get into R&D and that is a really significant point, but a
couple of things are happening. One is that as people in individual disciplines
learn the lessons about collecting, manipulating, using, and providing access
to data, they are learning very important lessons that really need to be shared
with the entire scientific community. Even though disciplines such as
bioinformatics develop very specialized tools, many of the lessons that they
learn in developing those tools and much of that knowledge is really directly
applicable to other disciplines.
As
we develop tools, such as data mining, which really exploit databases, sharing
this knowledge is going to become more and more important, not only on a
discipline basis, but also on an international basis. This is why we think
CODATA can play a very important role in helping groups, such as those
represented here, who have formal responsibility for national data activities,
to cooperate with other scientific disciplines and combine to share the
knowledge that is developed to gain insight on knowledge that other people have
developed to serve your particular agenda.
I
mentioned that CODATA has a number of task groups that have been going on for
some time in the biological sciences, in physics, and in materials science and
then some that deal more with getting people to work together. We have also
started other very important activities, which are aimed at data quality, data
evaluation, and the validation of tools that are used for data collection, data
evaluation, and data mining.
Some
CODATA task groups attempt to define specific formats for reporting data and
sharing data on a very large discipline-wide area, while some other groups are
addressing specific problems. For example, there is a regional task group
focusing on Scandinavia and Northern Russia, making sure that environmental
data related to the disposal of radioactive material in ocean waters in that
area are of high quality so that people can figure out what kinds of problems
they have and also the severity of those problems.
I
put this slide up here just to give you a flavor of the countries that are
really involved in not just being in CODATA, but running CODATA. You can see
they really span the major countries of the world that do research and
development, and that do data work. It is through these people and their
colleagues, who attend the meetings, that CODATA has the opportunity to make
liaisons at both the highest level and at a working level that would allow
groups within the federal government and the private sector here in the United
States to advance their national data agendas into the international arena
through these kinds of activities.
I
would like to end by saying I feel a personal responsibility for getting many
of you people here today away from your everyday schedule to talk about these
events and, you know, I really appreciate Paul Uhlir and Goetz Oertel
responding to my request to get this community together—people who worry about
numeric scientific data, quantitative scientific data, to share with us what
your agendas are.
We
too have an agenda today. First on that agenda is to make you aware that CODATA
is a vibrant, exciting organization, and that I, Lois Blaine, who is on the
CODTATA executive committee and Paul and Goetz, who represent the U.S. activity
here, are very much open to your suggestions for work that will help you in the
way that advances your data activities. We are all equally available and any
time any of you would like to talk with me about possibilities of maybe we
could get a small group of people together internationally. We have, I think,
something to offer you.
Last,
but not least, is that as you speak, even though there is a rapporteur here, I
actually have a very, very good memory. I hear interesting ideas. I will file
them away and you can be sure I will get in touch with you because I think this
is really probably the most exciting time in the data arena. I think that we
have wonderful opportunities to advance our data agenda, especially on the
international level, and I would be more than happy to work with you and the
rest of the U.S. National Committee, to achieve some of the goals. With that, I
will end and take questions, but I would rather keep questions to me rather
limited, so that we have time to go through everybody today.
DR.
OERTEL: Thank you, John.
DR.
GAINES: If I may raise one question, you stressed numeric data.
DR.
RUMBLE: Quantitative, I think, is a better word.
DR.
GAINES: Okay. I was going to say if that includes zeros and ones, then I think
we are all happy.
DR.
RUMBLE: –and modern database management is how you differentiate between a zero
and null and not applicable. So, we have a lot of good discussions about that.
DR.
VAITUKAITIS: When one sets standards, how does one then investigate the plot of
those standards?
DR.
RUMBLE: Briefly, these standards are an economic activity and there is almost
virtually no way you can make people do that. In industry, the way they do that
is groups like ASTM develop a standard and companies say when we work together
we will reference this standard, have to do things that way. Scientists don't
have that kind of economic activity. However, agencies have projects where
people are collecting data and going to submit them to a common repository. The
contracts and grants for these projects can call out those standards.
Second,
if you do your standards work well and it is clear, understandable, robust, and
meaningful, the community often will band together and use it voluntarily.
There are examples of this in crystallography, surface science, and some of the
bioinfomatics. Medical researchers are some of the poorest advocates of
standards of that.
DR.
OERTEL: Thank you very much.
I
would just add to that a couple of words on the activities of the U.S. National
Committee. When I first became involved with the committee, the emphasis was on
studies, including Finding the Forest in the Trees (NAP, 1995) and
Bits of Power (NAP, 1997). Little did we know at the time how important
these studies would be. When major policy developments came about in the
intellectual property rights arena, and more recently in the legislated Freedom
of Information Act extension to allow for access to research data in an
unprecedented way, we had the staff here prepared with consensus positions,
reports that had gone through the NRC system and that had the NRC's blessing.
These reports turned out to be very powerful in presenting the position of the
Academy and of others in this country against what might have been and still
could be major intrusions in our ability to do our jobs in government and
elsewhere in the United States.
So,
I would say that without us necessarily knowing where we were headed, we were
well prepared for it. One of the real challenges that I see is to know where
the challenges of the future are going to be, what is going to bite us next. We
certainly didn't anticipate the recent FOIA data access provision. The
legislation came out of the blue, and the committee discussed this topic
yesterday and we will continue to think about that issue. This is another area
in which you can make important contributions today and tell us what, perhaps,
you see coming around the bend next. So, without further ado, I would like to
call first on Peter Weiss from OMB.
DISCUSSION WITH AGENCY DATA
MANAGERS
Peter
Weiss, Office of Management and Budget
MR.
WEISS: I am going to be brief, if that is humanly possible for me. First, a
disclaimer. I am going to be blunt and brutal here this morning and the views
expressed are my own and do not necessarily reflect the Administration or OMB.
There
is a single major issue and a single major challenge that needs to be
confronted. A lot of people have been concerned about the imperfections in
carrying out United States domestic data policy in the real world, or about the
scientific issues around database protection and issues around data under
grants, but all of these are merely subcategories of something much bigger. The
biggest mistake you all can make is to presume that the policy of open and
unrestricted access to scientific and technical data as enshrined in both
statutes and policy in the United States is number one, secure and, number two,
not potentially severely undermined by things that are going on around the
world.
Complacency
on this point is your enemy. You should have, I hope, my article that appeared
in Borders in Cyberspace,3 talking about the growing trend, particularly among
our European friends, of what I call "government commercialization."
This is the trend to attempt to create government data monopolies for the
stated purpose of revenue generation, but with the clear secondary purpose of
bureaucratic self-aggrandizement. We see it most specifically in the European
Union (EU) weather services, and I use the World Meteorological Organization
(WMO) Resolution 40 as a case study of this phenomenon. However, we are seeing it
more and more in other areas. The Royal Ordinance Survey in Great Britain is an
excellent early example of the concept of governmental authorities telling
scientific and technical data agencies to get their butts off budget and
attempt to aggressively use government copyright to recoup costs. The
bureaucrats love this, because it allows them to make believe they are
entrepreneurs. The concept of the "entrepreneurial bureaucrat," I
suggest, is an oxymoron.
This
trait is being further encouraged by the European Union’s Database Directive.
Again, I will try to show you how these are connected. The European Database
Directive specifically contemplates government agencies protecting otherwise
non-copyrightable data. Whatever you think of the merits or lack thereof of
database protection legislation in the United States with regard to privately
developed information, it is critically important that any database protection
legislation in the United States, number one, not apply to data generated with
taxpayers' dollars.
Number
two, we need affirmatively to roll back this emerging policy and practices in
the European Union, and we need to convince anyone we can both in the European
Union and around the world that this is the major threat. You have in your
handouts the DG-XIII so-called "Green Paper" on "Public Sector
Information: A Key Resource for Europe".4 This is a wonderful
document that says all kinds of great things. I accompanied Jack Kelly, head of
the National Weather Service, to Geneva, in February to meet with our European
Union and WMO counterparts. Not only do they defend the specific policies that
require them to be entrepreneurial bureaucrats, they not only stick to their
right to be entrepreneurial bureaucrats, but when the Green Paper was pointed
out to them, it was absolutely fascinating because consistently they all said
DG-XIII does not know what it is doing. They said that DG-XIII is not really a
player in this area, and that they (not DG-XIII) represent the policy of their
governments. This is just a green paper, they say. They basically said that it
is not going to go anywhere and that they have every intention of ignoring it.
Were I not so naive, I wouldn't have been surprised. But being naive, I was
surprised. I have been unable to determine from a purely political perspective
what the future of this document is. And that is key, absolutely key.
The
last point that I want to make is that I am not a scientist. I am not a data
person. I am just a country lawyer, but I have been hanging around Washington
now for over 20 years and I am very experienced in congressional and other
political issues. I have become over the last couple of years shocked and
chagrined by the fact, as I perceive it, that both domestically and internationally
the scientific community has been remarkably ineffective politically.
Domestically, I don't know why, but the scientific, research and educational
communities have not been able to gain traction legislatively on the database
protection issue. The scientific community, such as the National Academies and
other groups, has written wonderful letters, and stuff like that. But the
political traction just doesn't seem to be there. It is very regrettable. I
don't know why that is.
This
whole issue of data access under grants, the so-called "FOIA issue,"
I have to tell you if you really thought that it came out of the blue, you all
have been asleep. This issue has been kicking around in Congress for over a
decade. It was inevitable that at some time an agency would engage in a data
policy practice, which became perceived, rightly or wrongly, as obnoxious. As
soon as that occurred, this type of a reaction was inevitable. As a matter of
fact, I am told by a friend of mine, who is a FOIA expert, that this issue was
explicitly raised almost 20 years ago before the American Bar Association.
Although—and because at that point in time there wasn't a squeaky wheel—nothing
became of it. There are equities on both sides that need to be balanced.
Regrettably, what happened in the 1999 Omnibus Appropriations Bill was that
some extremely poor legislative drafting occurred.
Again,
I note that grantee folks tend to look to their grant agencies to be their
protectors. A good buddy of mine is a grant manager at NIH and they have been
inundated by people complaining that they "do something" about this
so-called FOIA issue. I can tell you that the people on the Hill are not
interested in anything that any of the grant agencies have to say.
So,
the bottom line is two-fold. Number one, we have a major challenge. It is not a
given that open and unrestricted data is the policy of the future. It will only
be the policy of the future if it is successfully fought for. Secondly, both
domestically and internationally, the scientific community has to get political
traction. I don't know how you do that, but you don't have it now. And I don't
know what role CODATA or ICSU can or should play in that. I leave it to you
all.
DR.
OERTEL: Thank you.
DR.
RUMBLE: One quick comment is that ICSU and CODATA have a Data Access
Commission, and the chair of that and I were talking about the fact that the EU
is going to be reviewing the database protection law. When I am over there next
week, we are going to arrange a time to have a meeting in Europe to get the
European community at least to raise their voice somewhat in advance this time
because last time they were caught totally off guard.
MR.
WEISS: My free advice would be this. When we were meeting with the European
weather people and they were disparaging the "Green Paper," I said,
"if we were to approach your respective commerce, science, and environment
ministers and ask them if they shared your view, what would they say?"
Their answer was of course they would support us meteorological folks, but they
got nervous when I asked them that. That should be a hint as to where you might
want to go politically.
DR.
OERTEL: You mentioned that the process was flawed, particularly how the FOIA
issue got through the system the last time. Is there any way that you all might
be able to police that in the future?
MR.
WEISS: When in the dark of night, dozens of, for lack of better words,
"silly" provisions are attached to major appropriations legislation
that make a difference between keeping the government running or not, no
president is going to veto those bills. Now, the Congress of the United States
in both the House and the Senate have a rule that, translated into English,
says something like: "thou shalt not attach positive law legislation into
appropriations measures." They do not follow that rule. Talk to them about
it. And I have got to tell you again on this political issue, they don't want
to hear from your favorite grant agency. They don't want to hear from
bureaucrats either. So, the scientific community is well advised not to rely on
intermediation. You must make your case politically directly because we
intermediators, for whatever reason at this point in time, are not successful.
DR.
OERTEL: Okay. Well, that is very clear and I appreciate that. I am not quite
sure how we are going to do it, but we will have to. Thank you very much. That
was extremely useful. I would like to next go to Alan Gaines from the National
Science Foundation.
Alan
Gaines, National Science Foundation
DR.
GAINES: I have prepared some viewgraphs to guide this discussion (see Appendix
C). One aspect of this that I should put in perhaps as a caveat is reflected by
my new title, which is Senior Associate for Spatial Data and Information. I am
focusing specifically on spatial information. That generally means a
geo-spatial reference. This partially reflects my earlier question to John
Rumble, because a lot of our data are not numeric in the traditional sense,
but, in fact, deal with imagery and so on. I know a number of you around the table
from other groups also share this interest. So, perhaps, one plea I can make is
that the realm of spatial, data and information, I think is also in a special
class of information more generally and it does have some of its unique issues,
as well as sharing all the issues of data, more generally speaking. However, I
am not going to dwell on that in terms of what I want to talk about today. I
will, in fact, speak very generally.
There
are four data management issues that I would like to deal with briefly in turn.
The first issue is multiple use, which, I think, will be common to most of us;
the fact that data may have value beyond the purpose for which they were
collected. This is one of the really exciting advances that has come up,
particularly with increased and advanced communication capabilities that we now
have, such the Web and so on. In fact, it has enabled the creation of new
knowledge from those data when they may be used in a different context from
which they were generated. This is really the heart, I think, of the
interdisciplinary science and research and I think is one of the most exciting
aspects of what is happening in the information revolution. I would state that
in order for this kind of exchange or multiple use of data to be successful,
the generation of metadata is absolutely essential. And I know that several of
you will agree with this because it is sort of a mantra. But the metadata are
required not just for discovery of the data by someone who isn't part of the
information community, but, in fact, also for some initial valuation as to
their usability or value in this new domain.
So,
dual purpose, but metadata are a real key and I would say that there is even a
requirement for the metadata to be extensible because when someone else uses
those data, they may be able to provide an evaluation of the utility of those
data in a different context. That evaluation should, in fact, be fed back into
the metadata so someone else doesn't have to repeat the process. They can have
some verification or validation of that particular use. This is going to be a
kind of a recurring theme, I guess, because I think it is extraordinarily
important.
Another
issue is interoperability at the data level, where in order for the data to be
useful, not just available, to communities other than those for which they were
generated, they have to be interoperable and probably the first Holy Grail
there is semantic interoperability. I paraphrase that by the question, What do
you mean by semantics?, because even that term means different things to
different people or different things to the same person in different contexts.
We have semantics at the machine language level. We have semantics at the data
level. We have semantics at the application level and at the human level or the
disciplinary implementation. And it is not just a matter of new terminology.
The bigger problem, in fact, is a word that is used broadly, but with different
meanings. So, when somebody hears it, they understand it, but not necessarily
in the way it was used. So, this is a real problem that needs work.
The
key, of course, is standards for a model that enjoys a wide acceptance in use.
I would say that thee things are necessary for the interdisciplinary sharing,
but my second point is that they should really be minimal and the concept of
chaordic as introduced by Dee Hock simply means enough structure or order to
make the system work, but still allowing a certain degree of chaos or
individuality in doing it. Possible solutions here will be things like
development of XML as a language, UML for modeling language. These may be
getting us there.
The
next management issue is data quality. Again, this is a relative term, not
absolute. Quality of the data in one use may be very different from those in
another use. Data that are high quality for the use for which they were
generated may not be as good for some other use. Again, there are multiple use
concepts, which introduces new parameters.
The
factors to deal with are uncertainty in the measurement, but also the total
consistency, the uncertainty in scale— in fact, when combining data from
different sources, they may be at different scales, and also, of course, the
source. We desperately need to devise some way of representing the uncertainty
in the data. Metadata carry that physically, but a lot of uses aren't
necessarily going to go to the metadata or they may not be able intuitively to
understand them. So, another one of the Holy Grails is to derive from visual or
intuitive way of representing the uncertainty in data.
The
fourth issue is archiving. This is something that we, obviously, all deal with.
An archive is a single source. The traditional archiving function is to make
sure that the data are refreshed so the internal consistency is good. But they
should also be extensible. I mentioned that before, both in data and the
metadata, but also in format and medium. If they are to be persistent, we need
to be able to accommodate new media as they come along. Similarly, storage
gives you the format and medium and I think we need to start dealing with
issues of disposal. As our data get clogged, we are going to have to start
feeding some of them out of the bag.
I
am running short on time, so let me go quickly through three quality issues. I
would like to introduce what I call protective sharing and this, perhaps,
addresses the FOIA issue. I think that it is necessary to give some limited
exclusive use of data to the people who are generating them. In science, this
is one of the things that we run up against as a principal barrier to full and
open sharing, and that is that people really want to be first to publish. And I
think that most of us would agree that that is a reasonable expectation. So,
some period of exclusive use should be allowed, then the data should be shared.
There are other questions of costs of archiving and management of the data. In
fact, I would make a real distinction. We often talk about free and open access
to data. I would say that is distinct from full and open and that sharing can
be done for a fee, as well as for free.
Questions
of restricted access, there are some data that need to be shared but not fully
for a variety of reasons. Following on that, a policy issue that is often not
thought about is that of liability, particularly when the data are being used
for decision support, decisions are made based on those data that may involve
money, lives, property, et cetera. If the decisions are wrong, there will be a
tendency for the user to blame the data source. So, I think liability in sharing
data is an issue that needs to be addressed.
DR.
OERTEL: Thank you very much, Alan. In the interest of time, I would like to
move on. I hope, Alan, you can stay around in case there are questions in the
general discussion later.
Our
next speaker is Gerry Barton of NOAA.
Gerald
Barton, National Oceanic and Atmospheric Administration
DR.
BARTON: Good morning. I am Gerry Barton from NOAA, which is part of the
Department of Commerce. I have two major issues and then some other things that
I will cover this morning. A major problem is the growth of the archive and how
we pay for it and how we manage it. For example, in terms of archive growth,
early in 1990 NOAA started off at about 130 terabytes and have grown in 1999 to
about 750 terabytes (see Appendix D). This happens in a number of different
ways as some breakdowns of data sets. It is estimated that the total archive
growth in 15 years, from 1999 to 2014, will grow to 20 petabytes. These are in
petabytes now, not terabytes.
This
growth rate through 2014 occurs for two reasons. The first one is major remote
sensing systems. GOES is the satellite that we have had forever and it grows a
little bit in the later years. The Doppler radar system that you are all
familiar with is also included.
NOAA
is also responsible for archiving data from the Defense Meteorological
Satellite in our Geophysical Data Center in Boulder, CO. The NASA Earth
Observation System also is expected to add tremendously to the data rate. In
addition, smaller remote sensing systems and the polar orbiting satellites are
also considered.
So,
we have this major growth in radar. The problem is how do we archive the data?
How do we take care of them and where does the money come from? NOAA’s budget
for the data systems has been level for a number of years, level in today's
dollars, so that it is actually going down in the constant dollar sense. We are
not funded in the data centers. The base funding does not cover the budgets of
the data centers. So, there is a lot of soft money that they keep looking for.
Some of these soft money sources are drying up. It is a very difficult problem
and then we have how are we going to pay for all of the data that are expected
to come as I just discussed. So, budget issues are just paramount.
The
second major issue, as Peter mentioned, are property rights and the
commercialization of data. Peter already brought up the weather system problem.
The data transmitted over the Global Telecommunications System used to be about
8,000 stations when we were playing with it very actively and that was about
ten years ago. I think it is down now under 7,000 stations and it may be more.
That is a major problem because these data are used for the weather models in
the forecast that you see every day. The more the countries hold back their
data, the less accurate the forecasts will be. We should be going in the other
direction. So, it is a major problem.
NOAA
also has problems with restrictions on data distributions, which Alan also
mentioned. These restrictions of distribution and related caveats are forcing
the data centers to look at individual data sets and impeding the release of
complete data sets. We have to clean that out and release that just on a
restrictive basis. It is creating major problems.
One thing that NOAA is doing—this isn't really a problem because we have some
money for it, but if we lose the money it could be a problem—is data rescue. We
have thousands and thousands of data sets of various kinds on magnetic tape, on
microfilm, and on microfiche. I think we are pretty much done with the cards,
but those are still some data in paper formats and we are getting those
converted over to digital formats. We have been attacking that problem with
several million dollars for the past several years and it is really paying off.
I think approximately 88,000 reels of microfiche and microfilm have been
converted over. So, data rescue is a major problem, but we are handling it and
hope to be able to handle it better in the future.
Regarding
data management for other things such as the spatial data, we work very
strongly in NOAA, in the Department of Commerce, and in the Federal Geographic
Data Committee (FGDC), but we have no money to really do very much. We have to
take it out of our own budget. So, we are trying to get through the FGDC to get
a coordinated budget for federal geographic data activities and we have a plea
for 2001 to try to get some money for that.
Let's
talk briefly about Landsat. NOAA will no longer be involved with Landsat. The
spacecraft will be launched April 15th, and USGS will distribute the
data. The data will cost about $600 for the first scene. After that, it is the
cost of the media. This really attacks what was done in the USNC/CODATA’s
report on Finding the Forest in the Trees. But it attacks a specific
issue that was in there. Now there has been the commercialization of the
Landsat data and the lack of research that resulted from it for a long time.
NOAA
conducts a number of international activities. We now have a node of the Global
Environmental Locator Observing System, which is an international project. We
also work with the Committee on Earth Observational Satellites (CEOS) and other
ongoing international activities. We also worked with GOIN, which was an activity
between Japan and the United States. That project just ended, and it will be
moving into CEOS. With regard to CEOS, Helen Wood is chairing the International
Global Observing Strategy project that has a disaster Web page that shows how
satellite data can be used for various disasters.
DR.
OERTEL: Thank you. The next speaker is Hedy Rossmeissl, who is from the USGS,
at the Department of the Interior.
Hedy
Rossmeissl, U.S. Geological Survey
MS.
ROSSMEISSL: I am going to talk about three issues that at the U.S. Geological
Survey we think are very critical data management concerns for us and some of
them have really already been articulated.
First,
data archiving and preservation; second, data integration; and, third, access
to data. For data archiving and preservation, I can parallel a lot of the
things that Gerry just said about scientific data and the amount of data. At
the USGS, we have scientific responsibilities for data in geology, hydrology,
biology, and geography. Many of these data sets are national in scope and we
see very large responsibilities growing in this area. As Gerry just mentioned,
for Landsat 7, the U.S. Geological Survey is taking a much wider role in
Landsat now than we had previously for actually managing the ground stations
and also for the preservation of all of those data. So, we also had a lot of
old data previously, but I think we are moving up into the petabyte level of
data collection in this satellite arena as well. So, we were very fortunate
last year and again for 2000 to have received some additional funds from OMB
for some additional archiving. We have been working on that issue with OMB and
we are hopeful that we can continue to add some additional funds to our budget
for these responsibilities because, again, we see them as being absolutely
critical for the scientific community to have these data sets. Long-term
preservation is imperative and we are seeing for the older Landsat data, a
number of sales of those data sets. So, they are fairly popular in the
community for looking at long-term studies.
USGS
is also working with some of the commercial satellite companies from the
perspective of long-term preservation. As the data get older, they are less
economically viable for those satellite companies, plus it is a big burden also
for them to be holding a lot of data for a long time. So, we are talking with
companies, like SPOT Image and some of the newer commercial companies that are
putting up satellites, to try to make some agreement about, again, the
longer-term preservation of those data. That will continue to be an issue for
us as those data holdings continue to grow.
Another
big issue related to the archiving actually concerns access, because we are
finding that in the past it was enough to say that the data were archived. Now,
with the Internet and other advances, people expect you to make it available to
them very rapidly. That puts a different flavor on your archive. It is not just
a matter of having the data and having them preserved, but now there is a whole
different set of requirements there to have your archive more easily
accessible. Again, with the volumes of data we are talking about, that is an
extreme challenge for USGS to meet.
Another
aspect of the archiving relates to real-time data on the Internet. We are presenting
many data sets now, many of which are online and you get different readings
every 20 minutes and that sort of thing. The preservation of those kinds of
data, when you take certain time spans, and how you actually archive those type
of data is, again, a challenge that we haven't had before because we hadn't
been presenting data that quickly. So, those are actually some of the biggest
issues that are puzzling us now.
Data
integration is certainly another very important consideration. Again, from the
USGS perspective, we have such a wealth of earth science information and now
the biological resources component has been added to the Survey. So, we are
working much harder within our organization to look at that integrated science
approach in not just our data activities, but in our science and how we are
approaching different problems. We have ecosystem studies and other studies
that we are working with other federal, state, and local agencies. Regarding
the integration of the data—some of these issues have already been talked about
related to the content of the data formats, the accuracy, and how you look at
putting those data together across a wide spanning set of partners. It is a big
challenge.
As
Gerry Barton mentioned, the documentation of legacy data sets is a major issue.
USGS has our data in the shoeboxes and we are trying to recover and make them
more widely available and also then to put those in a data integration
perspective.
With
regard to copyright and data restriction issues, one issue is the use of data
from state and particularly local constituencies, which are looking at gaining
resources and money from their data. Working with those agencies to get
information that we can put in our national databases is a challenge.
Also,
we have been trying to work more with the private sector in areas like
transportation data, for example, where you have many, many companies now that
are working in that area and they have other issues. We don't have the
resources to have that kind of detailed data. So, we are working with them and
trying to convince them that maybe lowering the accuracy of some of those data
and getting them into the public domain is useful.
I
already mentioned a couple of things about data access, particularly the
expectations of the customer community now that they want data accessible on a
much more timely basis than we have had to deal with before. As a government
agency it is hard to retool all of your systems on an adequate basis to meet
that need. We have been working on offering those data through the Internet,
creating a multiple set of tabs for people to actually find the data easier.
Another
issue that has been brought up before is the fees for access versus access
without a charge. Our scientists really want to see the data accessible without
charge. Some of us who are working in the data area, however, feel that if we
don't try to work some new algorithms for some fees for those data, we are not
going to be able to keep the infrastructure in place to be able to offer the
data. So, we are dealing with that issue—it is not just buying a product
anymore, it is how do you get some of those data access mechanisms set up.
We
are also finding as we are talking to the private sector that they are willing
to offer some of these data without charge so that they can then add value to
them and offer more enhanced products. Sometimes the federal agency is a little
bit leery about some of those arrangements, but we are getting a lot more
interest from the private sector in offering data for them to consider value
adding.
I
guess that sums it up.
DR.
OERTEL: Thank you very much. I think we are getting great information from all
the speakers.
MR.
WEISS: Another species of the cooperation with the private sector than the one
you mentioned was your terraserver initiative. You might want to talk about it
for a moment, explain what that is.
DR.
ROSSMEISSL: USGS has a relationship with Microsoft. They approached the agency
and were interested in offering our digital ortho photo data. Their interest
was the fact that they wanted to build a very large database and show that the
capabilities of their sequel server software could handle very large databases
on the Internet. We had a very large database. So, we came together. It has
been an excellent relationship from the perspective that USGS has been working
with the research side of Microsoft, not the commercial side. There hasn't been
a big issue about them wanting to make a lot of money off this site. And for
us, we have had a tremendous amount of visibility for those data, a lot more
interest in them, much more widely known and used now.
So,
again, we are finding that other companies are coming to us now and are willing
to do some presenting of data, again, for more of the public good. I think that
is a good turn of events that we are seeing. Microsoft is now interested in
moving that terraserver to their ENCARTA online activities. So, we will see how
this evolves down the road.
DR.
OERTEL: Breaking new ground. That is very interesting.
DR.
RUMBLE: When federal data managers worry about some of these issues like
archiving and interchange and things like that internationally, how do you do
that? Do you just have a limited number of partners, like the European Space
Agency and Japan Space Agency, that you deal with, or what mechanism do you
use?
DR.
BARTON: There are many, many different ones. It just depends on the situation.
For example, one that I mentioned earlier was an agreement between our
president and the prime minister of Japan. So, it was at that level.
DR.
OERTEL: I would like to move on to Bob Shepanek from EPA.
Robert
Shepanek, Environmental Protection Agency5
Overall EPA Direction
"EPA’s
data resources represent one of the Agency’s greatest assets. As a national
Federal source of reliable and comprehensive statistical information on the
state of public health and the environmental, EPA is uniquely equipped to
provide the public with critical tools to pursue responsible policies."
Browner and Hensen, EPA Reorganization Memorandum, February 1997
Strategic Drivers
Practice
of Environmental Science is Changing
Technology
is Evolving
Challenge
to ORD: Leverage technology to meet the needs posed by changes in
the practice of environmental science!
Scientific Information Management
Challenges
Technical
Challenges
Management
Challenges
Cultural
Challenges
Cultural
Challenges
Influence
change through scientific societies and peer pressure – CODATA efforts on H.R.
354 and A-110
Management Challenges
Commitment
of adequate resources for systems development, operation and population
Support
for related policies and procedures
Appropriate
incentives for involvement by staff and project participants
See
USNC/CODATA report, Finding the Forest in the Trees (National Academy
Press, 1995)
Technical Challenges
Improve
access and documentation of data and information relevant to environmental
scientists in other countries
EPA ORD Activities
Re-inventing
ORD as an Organization Engaged in Science Information Management
Scientific
Information Management Policies, Procedures and Standards
Outreach
Development
of an Advanced Architecture
Additional Information
DR.
OERTEL: Thank you, Bob. Talking about your outreach point, I am glad you are
here to liven things up and it is very valuable. The next speaker is Wanda
Ferrell from DOE.
Wanda
Ferrell, Department of Energy
DR. FERRELL: I am in the DOE Office of Biological and
Environmental Research Branch, where we do climate change and genome research
(both human and microbial). The human genome work is done in cooperation with
NIH.
Our
largest global change program is the Atmospheric Radiation Measurement (ARM)
Program. The objective of ARM is to improve the parameterization of the role of
clouds in GCMs. The observing capabilities for ARM are located in three
geographic regions: the U.S. Southern Great Plains, the Tropical Western
Pacific, and the North Slope of Alaska. The Southern Great Plains site is the
oldest and largest with over 200 instruments; we have named it a climate
observatory. The North Slope of Alaska site is located in Barrow with a second
one coming on line later in the year. The Tropical Western Pacific site
comprises stations at Manus and Nauru with a third to be added next year.
In
the ARM program, data transmission is a problem, since we are using a
distributed data system. With the exception of the tropics, we have developed
site data systems for processing. The site data are then sent to the experiment
center, located at the Pacific Northwest National Laboratory, for some
additional processing before being sent to the ARM Archive. The Archive
provides access to the general science community.
DOE
does reserve the right to charge if someone were to ask to empty the archives,
but beyond that, data are available at no cost. The issues I will address here
are intellectual property, resources, and metadata. Concerning intellectual
property, I only exhort you to continue your good work in tracking this issue.
This is an area that presents a lot of potential problems for the science
community, and I think CODATA has been in the forefront in raising the flag for
the agencies and the general community as well as addressing the problem. You
have kept us apprised, been way out ahead of the rest of us. So, any help that
we can give you, we will be happy to lend our support.
Resources
for data are always an issue. Specifically, in the genome program in the next
18 months there is expected to be a huge explosion of information. I am sure
NIH will speak to this as well. The big push to complete sequencing is going to
create a huge bulge of data flowing into the system. We have to plan for not
only storing these data, but also providing useful search tools for finding the
needle in the haystack.
A
general issue is how to package information so that it is useful to the primary
scientific community, as well as to secondary users. We cannot always
anticipate the audience for our data products. Thus, we have to respond to
needs identified by the broader user community.
User
expectations drive the resource requirements. When technology changes and as
technology changes in the general community, people expect the data systems to
respond accordingly. Similarly, success brings new requirements. For example,
in ARM a few years ago we created special data products design to address high
priority scientific questions. As the usefulness of these early products was
demonstrated, it created a demand for more. Staff are dedicated to creating new
special products and to modifying existing products.
It
is a common misperception that you build a data system and go away and leave
it. A data system requires no enhancements. But users expect a system to be
responsive and to be updated to include new technology developments. The Web
has been the basis of growing user expectations over the past few years. We now
have to respond to these expectations and to respond in a flat budget
environment.
Metadata
is a very important issue for ARM, particularly for the area of data quality,
which is our highest priority. In ARM, how to tag the data files is a major
issue, since we currently produce over 2 gigabytes a day. As the sites grow and
as we produce special data products, this number will grow dramatically. We are
not like the satellite systems that have huge files. DOE has lots of small
files. So, this produces a problem for us that is not usually addressed by data
management committees.
Another
issue is constantly developing new Web tools so that we can search the
databases that we have.
My
last issue is a personal one and not a DOE position, that is, we should have
some means of publishing data sets or at least having some process by which
data sets can be cited. Yesterday, the committee discussed how to track the use
of data sets. Citations for data sets answers this problem. Publication of
techniques also provides documentation for processing. Publication by people in
data management helps with career advancement and recognition. We need
incentives for technical people to work in data management. Our data centers
employ a mix of computer scientists and physical scientists to produce needed
data sets. But for physical scientists it is important to their careers to publish;
thus, it's difficult to recruit without this incentive.
DR.
OERTEL: Thank you, Wanda. If there are questions to Wanda, then we’ll move on
to Richard DuBois from NIH.
Richard
DuBois, National Institutes of Health
DR.
DUBOIS: One nice thing about being one of the later speakers is that a lot of
these issues have already been discussed.
I
am with the National Center for Research Resources, which is a freestanding
center within the NIH. We specifically are not data handlers, but we do support
a great deal of research activity that deals with the handling, the generation,
the storing, and the analysis of data. So, we feel we have something to say
here.
Basically,
I want to start with this slide that says "In recent years it has become
evident that only through the integration and analysis of heterogeneous data
will it be possible to truly understand and control disease." It wasn't
until rather recent times that this was obvious, but it is obvious. Because of
what I am going to say later, it will become obvious that this has been a major
problem for us because of a number of databases. For example, there are the
crystallographic databases at Brookhaven and Cambridge. The Brookhaven
Crystallographic Database has over 10 gigabytes. However, where we really get into
high level data is in the imaging databases. For example, we have a site at
UCLA that currently has over 2.2 terabytes of storage. That is a lot of
storage.
But
one of the areas that we are involved with, the one with DOE, is in the GenBank
database, which is over 13 gigabytes of which the human genome component of
that is 6.5 gigabytes and the other, of course, with these various other areas
like yeast, Drosophila, which is fly-based, C. elegans is the worm based, and
then mouse. Actually, the mouse gives new meaning to are you a man or a mouse.
I
want to show this slide which displays volume sizes by resolution for the
brain. This is from the resource I mentioned before at UCLA. If you look at a
typical brain, it is around 50 or a hundred cubic centimeters, if you have a
voxel size of 1 centimeter, which would be the resolution, you end up with a
4.5 kilobyte of database. But as you move to a millimeter, it jumps up by a
factor of 10, which you would expect. Well, when you get to the 10 micron area,
it is 4.5 terabytes and then if you get to a 1 micron area, it is in the
petabyte range. The research group has digital cameras measuring in the 3 to 5
micron region and assuming at some point we are going to want to be looking at
cells and that is in the 2 to 4 micron area. So, as you can see, the amount of
data that are going to be collected in these areas is immense.
So,
where are we going? Well, just looking at the genome, the next step once you
get the genome is to start looking at the proteome, that is molecular function.
What do these proteins do? The ideal access infrastructure has to be put
together as sort of a middle ware development. Not only do you have to develop
this middle ware, but it has to be transparent to the user because most
biologists don't want to be computer scientists.
Finally,
there is the development of data mining, which we have heard about before,
integration and modeling situations. Data mining it is a fairly newly coined
term. As I understand it and we deal with it, data mining has to do with
looking at data with the goal of developing new relationships, but usually
without any particular target at the outset. So, you can imagine how difficult
that is going to be. I mean, the kind of software that is going to have to be
developed to be able to do the data mining is not going to be easy to
accomplish.
Going
a little bit further, although people are constantly thinking about this
already, is getting to the cellular and organ function and physiome. Now, this
will require extremely complex models and simulations. And it is only going to
be created from using heterogeneous data sources. There is just no way they are
going to be able to even to begin doing this. Certainly, more sophisticated
data handling and visualization tools will be needed, including the use of
expert system applications; that is, so far, artificial intelligence. You are
going to have to be able to get help from well-crafted software that provides
this expert capability.
So,
from our point of view, what are the management issues? Well, these are not all
of them, but it is certainly the important ones. We are very interested in
standards. It is very important that databases in the future, even today, be
able to talk to one another. That is happening, but it is happening slowly and
not very efficiently in many instances.
Training
is a major issue. It turns out that the people who do all this work that I have
been talking about are scarce and those that are good and even some that are
not so good get hired up by the pharmaceutical industry very, very quickly. The
other problem is that in the university environment, those kinds of people are
sort of between a rock and a hard place. They are not really computer
scientists and they are not really biologists. So, they don't really do very
well in that environment. Those issues have to be addressed. Quickly, with
regard to policy issues, we are concerned with the issue of accessibility and
the whole area of intellectual property and that has been discussed somewhat.
Then security is a major issue with us. With clinical data especially, security
is vital. Even the researchers want their data to be secure.
MR.
WEISS: Are you talking about data integrity, privacy, or both?
DR.
DUBOIS: Both, especially with the clinical data.
What
can CODATA do for us? Well, based on what I have heard this morning, we
certainly need standards and to the extent that this group can help us get
these standards, we would welcome that. There is no question about that.
Finally, accessibility has become
quite a problem. We reviewed a resource recently and one of the problems they
had was to look at sudden cardiac death syndrome. They want to look at these
data. One of the good databases for that happens to be in Italy. The Italians,
for some reason or other have good databases. However, their data are not free.
This is the sort of thing we are worried about.
MR.
CHINMAN: When you speak of needing standards, I assume that within your
immediate community that you have a standardized way of labeling things, but
the standards that you are talking about have to do with computer standards or
format standards. Your community needs to get together and define its own set
of standards and then there is --
DR.
DUBOIS: On some level that is true.
MR.
CHINMAN: Then there is the CODATA involvement that could help --
DR.
DUBOIS: There could be -- yes -- sort of a push like what Dr. Vaitukaitis
mentioned earlier, you know, the medical community is not that eager to jump at
standards. Some people look at that as a threat in a sense.
DR.
VAITUKAITIS: The other thing we need is appropriate laboratory tools for access
to high-end technologies, but also laboratory tools for the Internet.
DR.
GAINES: I would make one comment on the visualization of large data sets.
Yesterday, NSF released a new program announcement dealing with exactly that
topic and, of course, we always welcome partners in this. So, if any of the
other agencies have interests in developing advanced techniques for
visualization --
DR.
GERSHON: I might add very quickly that it is too late, too little.
DR.
DUBOIS: We are really aware of the problem and, in fact, we are in the process,
hopefully, of establishing one, possibly two, centers in that area within the
year.
DR.
OERTEL: We’ll now hear from Elliot Siegel of NLM.
Elliot
Siegel, National Library of Medicine
DR.
SIEGEL: Peter and I just had a wonderful conversation. I think we have, I hope,
come to a meeting of the minds that bureaucratic entrepreneurs are not
oxymorons. Some of my best friends, including myself, are motivated by trying
to work smart and making a difference.
MR.
WEISS: With the exception of certain people in European weather services.
DR.
SIEGEL: Let me tell you what I am not going to talk about. I am not going to
talk about health data cards. I am not going to talk about genomic databases
and I am not going to talk about toxicology or environmental health, all those
wonderful things that the entrepreneurial bureaucrats have been working on over
the years.
I
will talk to about something you probably don't know much about and the issue
was raised yesterday about the danger of there being a greater gap between the
haves and the have nots in terms of accurate information. That is a project
that NLM has been involved with now for almost two years on behalf of the NIH
and it is a project that Dr. Harold Varmus, director of NIH, is very keenly
concerned about on a personal level.
NLM
is working on a Multilateral Initiative on Malaria. I have given you a handout
that is an overview of what NLM is doing (see Appendix F). What NLM is
attempting to do is to work with the fundamental objective of this multilateral
initiative to enhance the capacity of African scientists to do research in
Africa.
There
was a conference held in Dakar, Senegal, in January, 1997 that looked at the
whole basic problem of ability to do research in Africa in general, not just on
malaria. What was identified there was the opportunity to work with the malaria
community, both in terms of research and control site because of the tremendous
havoc that this reemerging disease was wrecking on the economy and on human
health. Three million people a year are estimated to die each year in Africa
alone from malaria. Most of these are women and children.
The
scientists who attended this meeting were very quick to point out that one of
the things that they feel a great lack of is the ability to communicate with
other scientists in Africa and with colleagues around the world, particularly
in Europe and in North America. NLM stepped forward after that, at the
suggestion of Harold Varmus, to play a role in this. We have been very active
in outreach domestically in rural areas in the United States for over ten years
and we are certainly interested in international activities. So, we created a
communications working group within the overall structure. I chair that group
and we play a catalytic role. We plan an advisory role and lately we have been
playing a funding role, too, because, frankly, you have to put your money on the
table if you would like to get additional monies. The fundraisers know that and
sometimes if you want to get something done, you have to pay for it.
So,
we advise, we coordinate site visits. We do technical consultations. We fund,
as I say, appropriate communications equipment purchases, installation, and
training. We are very concerned about library infrastructure development. Once
you get access to information, you need an infrastructure that can handle
knowledge management.
We
are also interested in evaluating what we are doing as a model for capacity
building in telematics in developing regions of the world. So, I am speaking
not just to CODATA's interest, but to the NRC that also has a great concern
about the situation in Africa and other developing regions, and specifically
involving the use of telecommunications technology as a means to improve the
situation.
The
bottom line is how to improve communications between scientists. We want to
improve access to needed scientific information, from libraries, electronic as
well as printed media, more databases and all the wonderful things that are on
the Internet.
We
have developed a plan for doing this. We didn't say we are going to solve all
of Africa's problems. We can't do that. We try to do one problem at a time, one
location at a time, one site at a time. The strategy we have undertaken is to
work with malarial research sites and where you can expect that there will be
an opportunity for sustainability once the work there has begun. We are looking
for locations where there are what we call champions on the ground, people who
you can work with. You can't orchestrate this from Bethesda and you can't do
this simply with consultants. We need to have people locally who are committed.
So, we seek out those people and those sites. We go there and visit. We go
through the muck and we find what needs to be done and we try to get it done.
We
found that things have changed dramatically over the past two years in terms of
Internet connectivity. Most countries in Africa with the exception of one,
Eritrea, have an Internet gateway, providing full Internet service. The problem
is it sits in the capital city and it doesn't necessarily go to the places you
want it to go and more often than not, it doesn't go to the malaria research
sites.
We
are in the business of providing that last mile connection to the malaria
research site. And this can be done by microwave link. It can be done by cable.
It can be done by satellite. When you are dealing with satellites, you get into
other sorts of problems, which I will address in a moment. But what we are
basically dealing with is where connectivity does exist, it is generally
unreliable. It costs too much and it doesn't provide you with the kinds of
things you need to do if you are going to have access to the modern data
resources we are all familiar with.
We
want to provide access, of course, to MEDLINE and other services that are
available, such as BIOSIS, GenBank, or other genomic databases. We have been
working with Lois Blaine most recently on a malaria research repository, which
she will probably tell you about.
We
have been working with folks who have been doing geographic mapping that track
population movements and the movements of mosquitoes and they use NASA
satellites for this. So, there are a lot of people involved and there are a lot
of organizations involved. That is kind of what I wanted to emphasize because
this is very much a partnership, a multilateral effort, that involves a lot of
different organizations.
Working
with these organizations, the funding organizations are very important because
what I didn't mention—this is critical to the strategy we have adopted—is we
want to inculcate within the research community the need to fund
communications. This is not a problem for NIH; we do this. However, it is a
problem for organizations such as the Institut Pasteur and Wellcome Trust;
these groups ought to fund access to communication resources as a cost of doing
research. It is part of the research enterprise. That is sometimes a hard sell,
but if you can't get that done, you are not going to get sustainability. Part
of what I find myself doing is not just worrying about satellite connections
and how many kilobytes you can transfer, but actually working with these folks
to convince them that this is something they need to pay attention to and they
need to fund.
The
partners, as I have mentioned, have been the African scientists and the
governmental agencies. They have been the donor agencies, including my
colleagues at NIH, Pasteur, Walter Reed, Wellcome Trust, World Bank, and USAID.
I have also brought to the table colleagues who are involved in the
international medical library community in Africa, and also the
telecommunications community.
I
have here a list of the places where we have been and the things we are doing.
I am not going to run through the details of this. You might look at this and
say NLM is acting as the phone company. We are not simply plugging lines into
walls, but you do have to deal with the technology and you have to deal with
the politics. We have learned that in some parts of the world, in Kenya, for
example, we are all set to go. We have done some wonderful work on the ground.
We need to get a piece of paper signed that says we can install a VSAT (very small
aperture terminal) ground station. We have got verbal approval. I don't have it
in writing yet. Until I get it in writing, I can't pay for it. I will give you
some dollar numbers so you know the order of magnitude I am talking about. It
cost basically $25,000 for a ground station, about $25,000 a year to operate,
but compare this to the fact that the phone bill in many of these places is
$30,000 a year. So, it doesn’t cost that much more. We hope that there will be
other technologies that will come along. We have heard about Iridium and
Teledesic. Right now, they are geared primarily to well-heeled business people
and governments. That technology is not there for the kinds of folks we are
talking about working with. We hope that will change.
We
hope that the NRC can exercise some persuasive power. We are interested in
becoming a part of that effort. In the meantime, we have to deal with what we
have available now. We are working in Kenya, as I mentioned. We have done some
great things in Mali, which has been a recipient of a lot of NIH largesse over
the years. We are ready to go in Tanzania in terms of work there. We need a
funding partner that can pick up the costs of their research and their
communications. We are working in building up library sites.
I
realize I am running out of time. The concept of DOCLINE libraries. That is
probably not a term that is terribly familiar to you but we want to encourage
inter-library loan and we want to build up libraries in Africa. So, we are
working with the South Africans. We are going to be working with a pretty good
library in Harare, Zimbabwe, and we are trying to build up the medical library
in Mali, so that they can serve western Africa, Zimbabwe can serve central and
eastern, and South Africa can serve southern Africa. We envision NLM serving as
a back up, but basically the Africans will be able to help themselves.
We
just got back from a meeting in Durban, South Africa in March. Harold Varmus
and Don Lindberg were there. I was there. We had a nice contingent. Nine
hundred people were there. And we have made some wonderful contacts in terms of
new places to work, in Ghana, Nigeria. We were finalizing plans with the
Medical Research Council in South Africa. The British Medical Association was
on the phone with me two days ago. They want to help. They want to provide
documents.
The
commercial publishers are not very eager to help and that might be something
that you can address. So, what I have laid out for you is basically an open
invitation. This is what we are doing. You are aware of the data needs. There
is a need for the data that we have. The question is how to get them there, how
to deal with technical and political impediments, and how can we get enough
people working together so that this can really make a difference.
DR.
OERTEL: Thank you very much. I would like to move right along to Pamela Andre
from the National Agricultural Library.
Pamela
Andre, National Agricultural Library
DR.
ANDRE: I would like to begin by sending regrets from my colleague, Dr. Judy St.
John. Dr. St. John is responsible for overseeing plant research at USDA
Agricultural Research Service. Plant genetics is a really big part of that
activity and she was looking forward to talking to you a little bit about what
is going on. Unfortunately, she couldn't. I just want to say that she is very
interested in continuing involvement with this group. I have collected
materials here. I will make sure that she gets copies of all of these.
I
will move on and talk about the library and information as distinct from the
data side. I want to talk specifically about an activity going on at the
Department of Agriculture related to the preservation of USDA digital
publications. As I have listened to all of you around the table, it is very
clear to me that USDA is at the very beginning of many of the activities that
you are very deeply involved in. And as I have heard you talk about the various
issues from technology to copyright to user access, those are the very things
that we are beginning to deal with. And I want to say that by way of warning
that I have already made notes and I would like to invite a number of you folks
to come and talk with us as we move through this process at USDA.
You
may be thinking about what has a publication got to do with data and electronic
data resources. I would suggest that at USDA we are at the very beginning of
dealing with agency policies and procedures relating to both electronic
publications, as well as electronic databases. In taking this approach relating
to digital publications, the expectation is that we have opened a very small
cover on a problem that we believe is, in fact, manageable in that the policies
and procedures coming out of this activity will, in fact, then be broadened to
include the broader database community within the department. So, there is, in
fact, a connection, although it may be a time before we get there.
This
whole initiative really began about two years ago with a conference that we
pulled together here in Washington, relating to the preservation of USDA's
digital publications. That conference was very much driven by users of those
publications, everything from the agricultural research kinds of publications,
to economic statistics and all of the agricultural statistics that really drive
a whole lot of the agribusiness community in this country. People were
concerned that as more and more USDA publications were coming out in electronic
form, what was happening to the old version of those materials.
At
this conference, we talked about the wide range of issues that many of you have
identified here, everything from what exactly is there at USDA that is
published in electronic form, the inventory kinds of issues, through what are
the technologies that we need, what are the archival procedures that will
ensure not just that this material is somewhere in an archive, but that it is
in fact accessible. You have heard a number of folks talk about those issues
around the table. The result of that conference was, in fact, a framework
document that I have here and I would be happy to share with any of you. It is
also available on the Web. It is entitled the "Preservation of and
Permanent Public Access to USDA Digital Publications." I have to
acknowledge Paul Uhlir as our consultant on this activity.
That
activity took place in the spring and summer of 1997. As you may be aware, USDA
is a very large department. There are over a hundred thousand employees and the
idea of getting such a framework document approved within the department was
quite a challenge, as you might imagine. Luckily, we have a chief information
officer (CIO) at USDA who is very well aware of what the department is lacking
with regard to managing this electronic data and information. She took the
initiative to approve this framework document and to establish a steering
committee to move forward to develop the policies and procedures needed within
the department to ensure long-term access to these publications.
I
was given the responsibility of chairing this committee and we now have in
place a committee of about a dozen folks, representing agencies within the
Department of Agriculture, who want to publish materials based on their various
activities. We have a number of stakeholders from the research university
community, from the agribusiness community, as well as some of our key federal
colleagues as well, such as the Government Printing Office and the archives. We
have come together to begin to grapple with how do we move this forward in a
department that is as large and as diverse as the Department of Agriculture.
Basically we are focusing on three things and, again, it was part of the
outline in the framework document driven by the user community, the inventory
as I mentioned. What are the things that are available in the Department of
Agriculture that need to be retained long term and what is the life cycle
process for managing those things?
I
will tell you that in some of our preliminary discussions with key agencies, they
are very concerned about the version of publication that is available today and
then they are concerned about the version that is going to be available next
week or next year. They are totally unconcerned with the version that was
available last year. I mean, there does not seem to be an understanding of the
importance of the continuum, the historical information related to progress in
American agriculture. So, I have to say that one of our first activities is
really an educational one. Following that, as many of you have noted, are the
technical requirements: what does it really take to archive and continue to
provide access to these resources once we have identified what they are?
Again,
I have heard a number of you around the table talk about large-scale data sets
and how you are managing them. Just be warned that I am going to call on you to
come and talk to us about that because I think it is very important that folks
at USDA hear from those of you in the federal community who have begun and have
a continuing commitment to these kinds of activities.
The
last item that we are really dealing with relates to user access. What good
does it do to preserve and archive material, if you can't give the folks that
need it access to it?
Those
are basically the key activities that we have underway. Included in this, as
you might expect, is the whole issue of metadata, how do you describe these
resources to facilitate access? I think it is going to be a very long haul. As
I said, education is a key part of this activity, making sure that people in
the department understand and acknowledge the importance of it. I should note
that at this point there is no budget allocated for this activity. So, what we
have is a group of very committed volunteers who think it is important enough
that they will basically take on the work, at least the preliminary inventory
work to try to move it forward. So, we are just beginning, folks, and I am
delighted to hear that so many of you around the table are moving ahead so
quickly and we hope to take advantage of that. Thank you.
DR.
OERTEL: Thank you very much. I am sure we will individually, as well as
collectively, help where we can.
DR.
RUMBLE: Pam, where in the USDA are the people who worry about nutrition
databases, say the nutrition composition of foods, especially on the varietal
level, not just rice, but the 50 different rices and things like that? Does the
National Agricultural Library have something to do with that?
DR.
ANDRE: Not really. There is a Food and Nutrition Service that has that as part
of their mandate. Of course, the Agriculture Research Service is doing research
in the area of nutrition. They are, in fact, developing those kinds of data
sets. But issues relating, as I said, to the long-term archiving and the ongoing
management of those data sets, there is no clear policy within the Department
of Agriculture related to that. Whether it is research that is going on in
forestry, in soil conservation, in nutrition, in plant genetics, and all the
rest of it, there is no departmental policy relating to how that electronic
information needs to be managed.
DR.
RUMBLE: What about international cooperation with respect to the databases, are
you aware of a central point? Does the CIO of USDA worry about that?
DR.
ANDRE: No, the CIO doesn't worry about that. Being a relatively new position in
the Department, the office really worries about its own stability at this
point.
It
sounds like that is not a new concept. But I have to say that USDA, as a
Department, is a very diverse group of agencies. There are 29 agencies within
the Department and traditionally they have all been extraordinarily
independent. Within the Department of Agriculture there are no standards
relating to like kinds of research. So, there are very diverse, very independent
agencies who have difficulty thinking that there are others who are doing
comparable things. Perhaps there should be stronger collaboration.
DR.
OERTEL: Thank you very much. I am curious, all kind of forms of bureaucratic
entrepreneurialship. The next speaker is Joe Bredekamp, from NASA.
Joseph
Bredekamp and Lola Olson, National Aeronautics and Space Administration
DR.
BREDEKAMP: I am from the Office of Space Science and to just give you some
context we have a very simple mission statement which is to "Solve
Mysteries of the Universe, Explore the Solar System, Discover Planets Around
Other Stars, Search for Life Beyond Earth." Yes, it is a rather awesome
statement of a mission, but actually it is terribly exciting and terribly fun
and right up front, I might say one of the things we are committed to in Space
Science is sharing the excitement of that scientific endeavor, not only amongst
the research community, but with the public at large. I think one of the best
examples of that are where we really want the public to participate in our
missions. Certainly, the Mars Pathfinder Mission is the best example of that.
Those images were made available around the world and a few of them actually
before the Principal Investigator (PI) saw them. He did take a couple of hours’
rest. That is a benchmark that would be tough to duplicate, but that is what we
hope to do with Saturn; literally have the public participate, not just view,
but literally feel a part of.
We
are data rich, data intensive, and I will kind of give you a feel for that, not
so much the data volumes, but just the operating missions (see Appendix G).
This vu-graph is an eye test and the idea is not to go through each of these,
but to give you some sense that it is a terribly exciting time because we have
a very rich set of missions operating, under development, and to be operated,
as we head into the new millennium. There is just a number and diversity of
missions that we have and they range from great observatory missions to small
PI-class missions that are literally operated from universities and private
institutions. It is very international, running from flying instruments on
foreign and international spacecraft to flying their instruments on our
spacecraft to collaborating jointly and operating a wide variety of missions,
small and large. It runs the gamut.
Data management. We really do feel strongly that the reason we