|
NOTE: This is an unedited verbatim transcript of the Symposium on Electronic Scientific, Technical, and Medical Publishing and its Implications prepared by CASET Associates and is not an official report of The National Academies or of the Committee on Science, Engineering, and Public Policy. Opinions and statements included in the transcript are solely those of the individual persons or participants at the symposium, and are not necessarily adopted or endorsed or verified as accurate by The National Academies.
******
THE NATIONAL ACADEMIES
COMMITTEE ON SCIENCE, ENGINEERING AND PUBLIC POLICY
SYMPOSIUM ON ELECTRONIC SCIENTIFIC, TECHNICAL AND
MEDICAL JOURNAL PUBLISHING AND ITS IMPLICATIONS
May 20, 2003
The National Academies
2100 C Street, N.W.
Washington, D.C.
Proceedings By:
CASET Associates, Ltd.
10201 Lee Highway, Suite 160
Fairfax, Virginia 22030
(703)352-0091
* * *
PROCEEDINGS (8:32 a.m.)
DR. SHORTCLIFFE: In the interest of starting on time, I welcome you to the second day of the symposium.
Before I introduce this morning's first panel, on the chance that some of you have to drift off at the end of the day, I thought I would say a few words right now about the steering committees. I assume all of your gratitude to those people here at the staff of the National Academy of Sciences, who have played a key role in putting this together, serving as members of our committee to a large extent, knowing this community as well. That is Paul Uhlir and Julie Esanu, and Kevin Rowan. Thank you very much for the staff support for this meeting.
Without further ado, we will start our future-oriented sessions for today, the first one chaired by Dan Atkins. Dan, do you want to introduce your panelists and take it from here? Thanks.
Agenda Item: Panel 4: What is Publishing in the Future?
DR. ATKINS: Good morning. And good morning to those in Web cast-land. I have been spending a couple of months on sabbatical in Berkeley. I am particularly appreciative of any of you who are in California attending this through Web cast this morning.
We have a very distinguished and interesting panel today. Paul Resnick from the University of Michigan, Richard Luce from Los Alamos National Lab, and Hal Abelson, who will help us think about and discuss what is publishing in the future.
There are short bios of the speakers in your packet and I won't repeat what is there. I just want to emphasize a couple of aspects of the work and focus of our panelists that I think are particularly relevant to their presentation today.
Paul Resnick conducts a wide range of research in the general area of enabling productive social relationships by use of information and communication technology. During his spare time here, he was out studying the emergent process of ride sharing in the Virginia suburbs, for example, and thinking I'm sure about ICT applications to that process. He is widely recognized as a pioneer in the area of ICT based recommender and reputation systems, and will focus on that in potential relevance to publishing today.
Richard Luce is the leader of a wide array of initiatives at Los Alamos that fall under the general rubric of library without walls, and is a pioneer in the open archives initiative.
Hal Abelson is founding director of both the Free Software Forum and the Creative Commons -- you heard a reference to Creative Commons yesterday -- and primary instigator of both the open courseware and the D space projects at MIT.
Yesterday there were a couple of references to this report of this blue ribbon panel that I chaired over the last year and a half. Here is a URL that you can go to to get PDF versions of the report. If you forget the long URL, and for those of you in Web cast-land, it would be simpler to send me an e-mail, atkins@umich.edu. I also can probably can NSF to send you a glossy cover version, hard print version of the report, if you want that artifact.
To set the stage for today's talk, we are all aware that the digital evolution is disaggregating the traditional processes of many knowledge intensive activities, but in particular processing of publishing or scholarly communication, to use a broader term, and is offering up alternatives in both how the various stages of these processes are conducted and who does them. So the functions of repository or metadata creation or credentialling review or whatever and long term stewardship can be separated, disaggregated, and different players than traditionally have carried out these tests can in theory perform them.
It is also changing the process by which knowledge is created, by which discovery takes place. This is most true, and the pioneering efforts in this arena are within the STM field. That is the central theme and the central study.
The creation of this report involved extensive testimony with leaders of essentially the entire spectrum of science communities funded by the NSF. We clearly documented an emergent stretched vision and enhanced aspiration of many science communities in the use of ICT to build quite comprehensive environments based on what is now being called cyberinfrastructure, things that go by names of collaboratories or grid communities, which are functionally complete, in the sense that all of the people, the data, the information, the instruments that one needs to do that particular activity in that particular scientific community of practice are available through the network and online.
So indeed, a growing number of research communities are now creating ICT-based environments or cyberinfrastructure-based environments that link together people data, information, computational tools, services, instruments, in ways that are functionally complete and relaxed constraints of distance, time, and where distance can be geographical and/or organizational, cutting across organizational boundaries and/or multidisciplinary.
There is a general trend we identified towards more interdisciplinary work and broader collaborations in many fields.
This is one of the figures from the report, where we have got the base of storage, computation, communication that is continuing to accelerate and rocket past at exponential -- or as Jim Duderstadt said yesterday, in some places hyper-exponential. There is this networking operating systems middleware layer. Then on top of that, the aspirations and recommendations of this report to try to create as common as possible a level of high-performance computational services, data information, knowledge, management services, observation, measurement, fabrication, visualization and collaboration services. Then to provide the wherewithal that this infrastructure layer could be deployed and customized on behalf of specific communities or interdisciplinary communities.
So some implications for publishing -- and there is an alternate executive of the big idea we are trying to get at today in the panel description. I have deliberately chosen another, because we are kind of in the horseless carriage stage of all of this, and the nomenclature is not quite there. But publications can now exist in many intermediate forms. We are moving to a possibility of more of a continuous flow model rather than a discrete batch model.
There were actually phrases used yesterday by various people that are in the vicinity of the idea we are trying to talk about. Wendy Lougee, I believe it was, said we are experiencing a shift from publication as product to publication as process. Jim Duderstadt made a reference to parallel flows.
One of the questioners later in the day came up and said, we are talking about largely today automating better, faster, cheaper what we have done in the past with ICT. We really ought to start thinking about using it to do new things new ways. He pointed out that we are moving into a model of continuous improvement in the digital realm, and also opined that that requires open access.
So raw data, process data, actual replays of experiments and deliberations that are mediated through a collaboratory can actually be captured, replayed, re-experienced, working reports, pre-print manuscripts, credential or branded documents, or you can even imagine things post peer review moving to some kind of hyper ranked state, or actually undergoing the privilege of being annotated by the leaders of the field. Those annotated versions of the documents now become available, and so forth.
The work products can be available at varying times to varying people with varying terms and conditions. So you again have a whole host of customized options that can be available.
Publications need not necessarily be pre-credentialed before publication on the Net. Their use in the Net can be credentialling. This is a way of talking about this that I owe to George Furnas, my colleague at the School of Information. In theory, every encounter with the document may be an opportunity to rank it in some way, create some kind of a cumulative sense of this impact or importance, could have alternate credentialling entities or methods, and you could pick your favorite.
The best bubbles up in use through the document's social life, in the sense of John Seeley Brown's Social Life of Information. Perhaps open source models could be applied here. That is beyond the scope of what we are going to talk much about today, but I think it is worth pursuing.
The raw ingredients, the data, the computational models, the instruments, the records of deliberation, could be online and accessible by others and conceivably used to validate or reproduce at a deeper level than has traditionally been possible.
Finally, the primary source data can be made available with at least a minimum set of metadata and terms and conditions. Then third parties, particularly in an open access -- and this is partly a part of the open archives initiative -- third parties can then add value by harvesting, enriching, federating, linking, mining selected content from such collections of open archives.
So in this session, our goal is to try to describe, illuminate and inform discussion about some of these emerging technologies, the related social processes, some specific pilot projects, challenges and opportunities that may provide the basis for these kind of future publishing processes. I put that in quotes, because we may someday not actually think about it explicitly as a publishing process, but more holistically integrated into the knowledge creation process.
With that then, I'd like to introduce our first speaker, Paul Resnick, from the School of Information at the University of Michigan. Paul?
DR. RESNICK: Thank you, Dan. When people talked about changing -- yesterday what I heard when people thought about changing the current publication process, I heard concerns that things would descend into chaos, that no one would know what documents were worth reading, because you wouldn't have the current peer review process.
I'd like to turn that idea on its head. Instead of going without evaluation, we will have much more -- there is a potential to have much more evaluation than we currently have in the peer review process. We can look at what is happening outside of the scientific publication realm at other things going on on the Internet to give us some clues about where this could go.
Today's publication cycle, there are reputations for publication venues. Certain journals have a better reputation than others, certain academic presses have strong reputations. There are a few designated reviewers for each article, and those reviews gate-keep the publication. It either gets in, or it doesn't get into this prestigious publication; it is a zero or one decision, binary. Then afterwards, we do have citations years later as a behavioral metric of how influential the document was.
If we look out at some other realms on the Internet, I think we can see some trends, and maybe we should see how they apply to scientific publication, where there might be putation for individual documents, and also for the reviewers. There would be lots of public feedback, both before and after whatever we mark as the official publication time, and we have lots of behavior indicators, not just the citation counts.
So let's think about some examples of publicly visible feedback. This is a watch, not unlike the one that I eventually bought, that was available for sale on Ebay. It was selling for $48; I actually ended up paying $99 for this one. It was sold by Fastsell 2, which made me a little suspicious, but that number 793 tells you that Fastsell 2 has actually done quite a few sales, and 793 people or a few more than that have left comments, almost all of them quite positive. But if you look down near the bottom, there was the one complaint: I like the watch but the face is scratched. If you are going to sell a used watch, you should specify it.
Some people would say a few of these complaints -- and it really is only a few -- maybe I should go ahead and buy it anyway; others of us who are more concerned or just don't want to deal with the hassle went and bought it retail.
The same kind of thing, not just for evaluating who you can choose to trust on eBay, has been applied to conversations. Here is one conversational site, it is a political deliberation site, and the stories get rated by people. If you are familiar with Slashdot, the same kind of thing happens with comments there. But here, they have arranged these stories based on the expressed interest of previous readers. So the one at the top has an expressed relevance rating of 201. The one at the bottom has a relevance rating of 98. Presumably there are some more further down that have gotten even lower ratings.
Amazon.com, we're getting closer to the publishing world. This I'm sure all of you are familiar with. There are reviews at Amazon, both text reviews and numeric ratings that any reader can put in. Many of us find this quite helpful for books. We don't quite have it for individual articles in scientific publishing yet, but we do have this for books.
Even closer to the scientific publishing world, this is a site called Merlot, which collects teaching resources. They have a whole peer review process. You can see here, I have shown a couple of resources, one called DNA From the Beginning, another called Physlets. For each of these they actually have a peer review process before they will include it in the collection at all. You can see, the peer review is over at the right, but even after it is published, members can put in comments.
Typically, teachers have tried using it, and they say this is what happened in my classroom and so on. The member comments don't always exactly agree with the peer review comments. You can see for the one at the bottom that the average rating for the member comments is a little bit lower than the rating that the peer reviews gave it.
So that is a sense of what is happening with subjective feedback that people can give and make public about things that they are using.
There are also behavioral indicators. You don't ask people what did you think of this; you watch what they do with it. That is like the citation count. Amazon, in addition to those customer reviews, they have the sales rank, in this case, 3,361. I haven't written a book, so I haven't had the pleasure of doing this, but the people I have talked to who have written books find it quite addictive to keep checking back and seeing what their sales rank is.
Here is Netscan. This one is going to require a little bit of explanation. Netscan is a project that Mark Smith at Microsoft Research has been doing to collect behavioral metrics on Usenet newsgroups, the Net news. You've got all kinds of metrics about the groups.
This is one that is organized by author. I have gone to a particular newsgroup here, and I have gotten a list of the authors from that newsgroup in the last 30 days. I picked the third one down and expanded it to see -- if you look at row number three or any of the rows, you see that you get metrics about this user, how many days out of the last 31 were they active in this newsgroup, did they post something; what is the total number of posts that they have done, how many different threads did they touch, how many new threads did they initiate in the conversation, and various other metrics. Then you can go get more specific things about particular threads that they have been involved in, and you can see that in some cases certain authors tend to dominate threads and other authors manage to say a little something in every different thread, and you find out something about the users this way.
Google uses behavioral metrics of links in their page rang algorithm. This is one that many of you have gone to check, how am I ranked on Google on various search strings. For reputation systems, the stuff I care about is right up there at the top. It turns out if you search on reputation or reputation system, I don't do so well.
But they are not just doing a text match; they are also taking into account how many links are there to this page from other Web pages, and they are even weighting it a little bit by the rank of the other pages that are linking into my pages. So this is a ranking system that takes into account this behavioral metric of who is linking to whom.
This is SSRN. It was mentioned yesterday. This is the download count as a behavioral metric. This is the all-time top ten downloads, 30,000 for the top one, but of course they also have them in lots of different categories, so that more of us have a chance to be a winner.
You are going to hear some more about some behavioral metrics that Richard Luce has been looking at in the next talk.
This is to give you a flavor of some of the things that I think we ought to be thinking about in the credentialing processes for scientific publication in the future. But there are some issues to deal with. Some of these are active areas of research for people who are working on recommender and reputation systems.
An obvious one is gaming the system. You make a bunch of Web pages that all point to yours so that Google will rank yours higher. In fact, there is a whole cottage industry. You can hire consultants who will help you get higher in the Google rankings.
It is a little harder to do this with the Amazon sales rank. It requires you to actually require some books. But people do try to figure out, when should I buy the books and should I concentrate it and buy them all at the same time, so that I temporarily get up there and get noticed, or should I spread out. You try to figure out what their scoring metric is and game the system.
You would really want to think about it as you are designing these metrics. The ideal metric would be the strategy proof, that the optimal behavior for the participants is just to do their normal thing, and they can't easily game the system. It is not always so easy to design the metrics in that way.
Another problem is eliciting early evaluations. In these systems where you have widespread sharing of evaluations, there really is an advantage to going second, let somebody else figure out whether this is a good article to read or not. And of course, if we all try to go second, then there would be no one who goes first.
Another problem can be herding, where the future evaluators don't really reveal what they thought of the document, but they are somehow overly influenced by what the previous evaluators thought. They just go along with the herd.
There are some interesting ideas that potentially would help with the herding problem, for evaluating evaluators, where you might reward evaluators for saying something that goes against the previous wisdom, but which subsequent evaluators agree with. That would be the person who finds the diamond in the rough, would get special rewards; the person who just gives random reviews would get noticed and would get a bad rating.
Also, this is going to require going back and revisiting some of the decisions that we have made about anonymity and accountability in review processes, single blind, double blind, not blind at all. I think we are going to end up for different purposes wanting different versions of that.
I'd like to suggest a few small experiments, and then I will conclude by saying where we might go in the bigger picture. I think for journal Web sites, some of these are more radical than others, but you might publish the reviewer comments. I think those could be of interest. I think they would cause the reviewers to be better, if they knew their comments were going to be published, even without their names attached. You might think about publishing the reviews for the rejected articles as well. I think you would get fewer really bad submissions if people knew that it wasn't free, and that they could potentially be hurting their reputation by having the reviews of that article up there.
Then after publication, we are all running Web sites where people can at least get the abstracts, so how about letting people put comments in that would be publicly visible.
Some other experiments are to try to gather more of these metrics. We are starting to see things like with Sightseer in the computer science area, where they are measuring citations in real time, but also using the link data, using the download data in those places where people are actually reading online get the reading data, get how many times this is being assigned in courses, all kinds of behavioral metrics like that.
Start to think about experiments in evaluating the evaluators. My guess is that the best place for this might be in some of the conference proceedings, where you have at least in the ACM world a number of evaluators for an individual article, so you could actually -- I know when I do those reviews, I always go look and see, was my review somewhere in agreement with the other reviewers, or was I big outlier, and if I was a big outlier, was I right or not. You might actually make that a more explicit thing for evaluating evaluators.
The question is, is this going to be the future of scientific publishing. I think we ought to at least consider it as a possibility, that the author would be the only gatekeeper on publishing, that there would be credit for authors based on reader feedback and these behavioral metrics. That credit would really go to individual articles and individual authors. You wouldn't have to only do this through the indirect means of, did they publish in a very high prestige publication. You really might be able to -- when someone comes up for tenure on various computations, hopefully better than the ones that are done now, of counting the number of publications, and actually get various scores on how influential somebody was.
This last bit of credit for evaluating early, often and well. I have been thinking about -- and this came up yesterday in one of the comments, that the people who are doing reviews or having trouble getting people to do the reviews. I think we are not valuing that sufficiently in the system. Having some metric that didn't just say, yes, you were a reviewer or, yes, you were an associate editor at some journal, that actually said, this was a really valuable evaluator. We might start thinking about that as a contribution to research, rather than in the service line.
In the promotion and tenure reviews for academics, we always talk teaching, research and service. Serving on editorial boards usually goes under service. But if we think about this knowledge generation and dissemination process in the scientific community, the people who are doing this evaluation and commentary might really be thought of as contributing to the growth of knowledge. If we could get some metrics on how much they are contributing in that way, we might think about that as a research contribution rather than just a service contribution.
Thank you.
DR. ATKINS: Thank you very much, Paul Resnick. Now Richard Luce from Los Alamos National Lab.
DR. LUCE: Good morning. I was given the task by Dan to talk about pre-print service and extension to other fields. The first thing that hit me was, pre-prints is something that is a well-known, well-understood concept in the physics community, and in other communities it is met with either a puzzled gaze or some other sorts of reaction.
I would like to talk today a little bit about the physics community and pre-prints as kind of a community specific response, where we have been, some enabling infrastructure for where we are today, and then I'll look at a possible tomorrow in terms of recommendation systems for where we may be going.
To start, let me just put out a couple of definitions to keep things clear. Pre-prints clearly has this buyer beware connotation to it in the physics community. It is an informal non-peer review feedback that is weighted very, very differently in the community than a formal refereed report. It is basically the idea of, get something out to colleagues -- one understands the concept of that -- get some feedback, if any may come back, and I'll think about whether or not I want to publish that later.
E-prints on the other hand slowly seem to be accumulating this notion of authors depositing papers or drafts of paper, either, in some kind of an archive in order to speed up the communication process, thereby giving authors essentially control over distribution of their work, and saving the decision to downstream related to formal publication issues.
So with that distinction, let me just start by going back in time a little bit, back in 1991 with the arxiv or XXX that Paul Ginsparg created at Los Alamos.
That archive today has about 28 or so database archives or fields and sub-fields, 244,000 papers. Fundamentally, it has succeeded in large part because Paul is a physicist. He understood well as a high-energy physicist how that community worked, what its needs were. His notion was to take and streamline that communication process relative to pre-prints.
It does give the author control over both distribution and access considerations related to that kind of communication. More recently over the last decade, it has certainly spread into other fields, mathematics, materials, nonlinear sciences, computation, et cetera.
So first of all, let me dispel the notion that is a phenomenon only in high-energy physics or only in the physics community, and therefore can't work anywhere else. That is actually not the case.
It clearly has increased communication certainly in the areas that it covers in physics. It is the dominant mode of registering here is when my idea came out. It may be published six months later, whenever, but you can look back in time and look at that stamp, in terms of the system there.
Cost is very, very low. Paul likes to quote costs that seem to me a little on the low side, but certainly very, very low cost, and consequently wide acceptance.
The driver in the community clearly is speed, how do we make things move faster. It is my belief in talking with and interacting with a wide variety of both society and commercial publishers that it was in fact these kinds of examples that get out on the edge, that cause the rest of the community to begin to respond and say, perhaps we ought to move in direction, perhaps we ought to try to hold on to some turf that we have, and maybe we ought to move our model to electronic and so forth.
Clearly we see a trend here in terms of a continuing increase in terms of submissions. I think it is significant to note, in 1995-96, the American Physical Society began to accept pre-print postings, and later began to say we'll make a link back to those. So this was the beginning of a real formalized recognition that there was a role for this, call it bottom layer or first tier, in terms of information, and that we could have a two-tier structure and start to link those things together.
The question is always asked, so is there any value. These things aren't peer reviewed, so is it just a bunch of junk in there. If one does an analysis over a period of time of the quality of the submissions, what you see is a field specific track record in terms of what actually gets published.
High-energy physics theory, about 73 percent of the papers in the archive turn out to be published. In condensed matter, somewhere around a third or so. So it is fairly field specific, but it is an indicator that it is not just things that have no future or have no role in terms of the formal system itself.
What lessons can we learn out of that? The issue of timeliness certainly. A few passionate people can make a difference. This system for a decade while it was at Los Alamos was typically run by on the order of three, four, five people, sometimes doing far too many hours, but very, very passionate. This real sense of, this is really going to change the world. So that small number of people, relatively small amount of dollars, on the order of about a half a million or so in the year, became a very, very dominant thread in terms of the community itself.
Most importantly, I think the lesson is that it addresses sociology of a community of common interests, which is why the system worked for that particular community.
Scholarly communication however is a very complex ecosystem. Clearly, all fields are not the same. The sociology, the behavior, the traditions differ from field to field. Consequently, this solution is not the correct or only solution, nor could it be expected to fit in a variety of other fields.
The one size fits all argument in some cases doesn't apply here. I think the issue or the lesson ought to be, one needs to really understand the community behavior, the traditions, how that community works, and then look for models that meet those kinds of needs and requirements.
There has been spinoffs into other fields. Examples of other e-print systems, Cogprints out of University of Southhampton is certainly quite well known in cognitive science, ancestral, as an early effort to get computer science papers harvested together and then start to build a federated collection of computer science technical reports. Certainly NT LTD with Ed Fox at Virginia, and the idea of being able to scoop up a set of thesis dissertations and so forth.
The NASA system, National Technical Reports Server, was an early pioneer in terms of trying to bring together a collection of federal reports and make those available, both in terms of metadata and the full text. PubMed Central and E-Biomed, certainly very, very visible and well known in the life sciences community. Living Reviews, a little bit different model out of gravitational physics at one of Max Planck's institutes in Germany. This is the notion of essentially creating a review that gets updated by the author over time. So rather than going back to 1995 and reading a review that is static and wondering what is happening to the field, authors who publish in living reviews commit to a process to try to keep the material they put into that online publication up to date.
And we have seen spinoffs in areas like economics and so forth.
If you count the number of open archive initiative compliant servers, you get around 100 or so people who say we have got some kind of an e-print system, we are going to use a standardized protocol, and we are going to allow people to come in and at least harvest some segment of what we collect here, harvest at least the metadata of that segment.
Unfortunately, we only count about a dozen or so service providers today. If we look at the problem space, it is an enabling infrastructure. Clearly the open archives initiative was not meant to be the end-all, be-all, fixed to the system. It was really meant to say we need to look for a solution that allows a discipline specific e-print archive to be able to talk to or communicate with other systems, so that one has the opportunity to go in and look at a pool of things. So we have a variety on the bottom level, a variety of different representation of different systems, and the problem is, how do we get access to this.
The protocol specifies the method by which things can be harvested. We are just reaching the point where we are starting to see people talk about, that is fine, now that we have this data, what kinds of interesting services can we put on top of that. That development has been slower than I thought it might be, but beginning to take off with a variety of different systems.
One example, Citeseer. As Paul Resnick mentioned, you can see -- this happens to be a submittal on archive, and you can see both the citations and what is happening over time. Again, real time. So this begins to hint at the kinds of things that in an open environment people might do with the service level related to no AI collection.
So what does this mean beyond physics, and what new efforts can we see? I want to note that there has been incredible opposition. I remember going to conferences back in 1992-93. High-energy physics is limited to that community. The next thing was, it will never work outside of physics. It will never replace online journals. But the system continues to bubble up and bubble up.
There is very powerful opposition coming from very traditional parties who have used the journal publishing either as a cash cow or secondary providers who see their secondary databases as essentially a birthright. I think unfortunately, we saw a lot of political pressure, which created the demise of PubScience because it began to threaten those kinds of interests, and there was a lot of talk about, we need to go after some other targets now. I suspect that may happen or may not happen, but essentially kind of a political track to hold onto the economic value proposition, or the economic position, I should say, that a number of companies have.
MIT DSpace, which I think we will hear more about, the European effort, Figaro, are some examples of models that can either co-exist with the current system or help the current system evolve into something better able to meet the needs of researchers.
I think it is very interesting to me to listen to this dialogue over a period of about a decade, and people talk about this system won't do this and won't do that, and there is very little discussion about, what does it do for the end user, and how do we evolve the system from the perspective of the end user as opposed to all the other players in the value chain.
So sometimes in this complex chain, I think we lose sight of to me what is the most important to mention, which is what is it we offer the end user in terms of what we are doing with the system.
If we think about the peer review system, I'm not going to take the position that peer review shouldn't be done. Clearly there is an issue related to how do you -- in the new world, how is something like peer review or how is quality assessed.
In the new world, we have this problem of quality. Rather than having a snapshot that someone takes, and sending that snapshot out for people to take a look at and make some judgment about, what we have instead is a movie stream or a video stream. So we have a very dynamic environment.
One can think about the issue where we have somebody reading something in PhysRev, letters, decides to do an experiment. Out of that experiment comes some simulation code. It is put on a server. That server is hit by a number of other institutions. People in those institutions decide to modify the code, re-run the experiment. Pretty soon you have a chain of people who are interacting with a phenomenon, trying to understand what is going on. At a given point in time, you have a different understanding of what that phenomenon looks like. You also clearly have a set of players who are responsible for in a sense the output of those ideas and how they get communicated. So rather than a snapshot and an article, what you have is this video stream, so it gets very, very difficult to both respond quickly and decide who is it and what is it that we are going to make some decisions about. That is what I am calling compound invoking documents.
So I think the real question that we are struggling with today in part related to the peer review question is, what is influence, and how do we detect influence. I think there is a variety of methods that one needs to look at, essentially a composite. Today we use citations as the sole indicator of influence. That is an author-derived statement about what is important, what has influenced my work.
I want to suggest that we might look at, at least a complementary path, which is the notion of reader behavior related to determining influence. I want to posit the idea that digital libraries or service providers can provide analytical tools to generate new metrics based on user behavior, which complements or may even surpass citation ranking and things like impact factors.
What is the problem with impact factors? To my mind, it is the lazy person's notion of how to figure out what is important in terms of journal ranking. It is very convenient for publishers to say my journal is ranked such-and-such in terms of impact factors. It is relatively easy for a librarian to justify this is why we buy this title instead of that title.
The problem is that the citation is only an indicator of influence. There are many reasons that people might cite a paper. I want to show that I have read the literature in a field. I want to disagree with somebody and prove them wrong. I've got a friend that I've got to make sure he gets some visibility and enhances reputation, or there is generally a good idea out there that I want to credit.
So impact factors are widely used to rank and evaluate journals. They are often used inappropriately, in my view. Then there is a whole field of bibliometrics, which tries to look at a more complex environment of authors, citations, journal citations and the subjects that are covered. In my view, still a fairly emerging field, but one that we are going to see take up more and more of a drumbeat in this area.
What would a multidimensional model look like, in terms of thinking about something in addition to citations? We have also thought a little bit about the Google model and said, it has some limitations also.
An ideal system might have the following things. You might look at citations and factor that in. You might look at co-citations, determine the nearness or proximity indicator. You might look at the semantics or the content and the meaning of the content in articles and see how they are related.
Finally, you might look at user behavior in terms of traversal paths. By traversal paths, I mean the following. I start off in the morning, I read a report say in the laboratory, a Los Alamos report. From there, I see a link. It refers me to an article in Science. From there I click on something and it takes me to JBC.
Statistically there is probably some relationship then, if those things are done in relatively short periods of time, between the government document I looked at and that JBC article that I have read. If one agrees with that premise and statistically starts to look at how frequently are those kinds of things connected together in a session, in a reader session, and how frequently are those occurring within a community, one starts to see a behavior pattern that I think can suggest things.
So in terms of social navigation, we are currently experimenting with a system that allows us to drive metrics at Los Alamos. We are able to do this because about 95 percent of what we have is electronic only. We can detect and determine community specific research trends, and we can look at where those trends differ from the ISI impact factors.
So out of this, I believe that we can develop some formal and informal hybrids, looking both at the e-print bottom layer and a higher layer of things that finally get published. We have got the issue of how do you deal with trans-disciplinary science, where things start to collide together and don't have a good answer for that today.
Finally, we have got the problem of long term curation. I want to put that problem in the context to finish three pieces of that. You have got whatever it is that is published or out in the literature, and that is the thing that when people talk about preservation, they think about. But secondly, you've got the issue of the relationship, let's say a rich linking environment related to that. So that is the set of things that you would like to preserve and be able to represent over time as well.
But thirdly, you have the whole question of the patterns of behavior related to those things, and that is something you would like to preserve and collect over time, and make available in terms of a curation perspective as well.
I think I am getting the hook, so with that, I thank you.
DR. ATKINS: Our final speaker is Hal Abelson from MIT.
DR. ABELSON: I was sitting yesterday and this morning, listening to Ted Shortliffe and Bruce Alberts and Dan this morning talking about how this panel was supposed to be about the future, this wonderful cyperinfrastructure future. I was reminded how William Gibson, the outstanding cyberpunk writer who 20 years ago gave us the word cyperspace. He said, he doesn't like to write about the future, because for most people, the present is already quite terrifying enough.
It is in that spirit that I want to talk about the present. I hope we are all here agreeing that what we are trying to do in this publication process is to promote the progress of science. And of course, what is happening, as was already said this morning is, the elements of that publication are starting potentially to disaggregate. We heard Paul Resnick and Rick Luce talk about some technologies for review, but there are lots and lots of other things where technology can come in and allow different kinds of players. And of course what is happening in this cyberinfrastructure is, we are now all engaged in this cyber videogame called, dis-intermediate thy neighbor.
The main thing that I would like to say is that in this present, the action you should watch for is not new technology, because it is old technology. The action you should watch for are new players coming in and finding institutional reasons related to their other primary missions to participate in this new game of dis-intermediate thy neighbor. In particular, I want to ask the question, do universities have institutional roles to play here other than what they have been so far, which is the place where the authors are. So do universities have a reason in their institutional missions to start participating in this.
If you look at MIT's mission statement, what you will see, and I'm sure lots of other places are like this, that MIT is committed not only to generating, but also to disseminating and to preserving knowledge.
How does a place like MIT or any university think about its mission to disseminate and preserve as well as to generate knowledge? One is getting pretty famous. About two years ago -- and here you see a statement from MIT's president's report, that says MIT has made an institutional commitment to take the primary materials that we use for our students in classes, that we create for our students in classes, and put those up on the Web for free open access by anyone.
The reason we did that is not that we were overcome by some fit of altruism; it is that we decided that given the way the world is going, it would be better for MIT in terms of fulfilling our primary mission to educate our students, if MIT and Stanford and Berkeley and all research universities and all universities put their primary educational material on the Web. It would be a better world for us.
Here is the open courseware Web site which you can go to now. It is currently just a prototype and has 50 courses up. There is a group at MIT which is madly trying to get up the first 500 courses by September. They are on a timeline to get up all MIT courses by -- I think it says in the small print there 2007. You can watch and see how we are doing, but the point is, MIT has made this as an institutional commitment.
When we started this, people said, sure, there are all sorts of courses that got Web sites up and people can access them. But in this audience, I need hardly emphasize the difference between lots of course Web sites that happen to be up and maintained by faculty members and an institutional publication process, that has committed to that as a permanent activity.
People also mentioned DSpace, which is the sister project of OpenCcourseWare for research. DSpace is a pre-publication archive for MIT's research. Again, we just heard Rick talk about pre-publication archives; there are lots and lots of them. The difference with DSpace is that there is an institutional commitment by the MIT libraries, justified in terms of MIT's mission to maintain that.
OpenCourseWare would make sense if only MIT did it, but DSpace can't possibly be like that, so DSpace equally to being a pre-print archive for MIT is meant to be a federation, which collects together the intellectual output of the world's leading researchers. We have six partners who are working with us.
Again, the importance here is not that there is a piece of software and some database that does it, although I should put in a lug for that. There is a very good pre-print server system that is very robust, it is available on the Web by open source and has already been downloaded by about 2500 places, but the important part is, there is a group of universities who are working out the management and sustainability processes in terms of their institutional commitments for how you would set up a federated archive like this. It is getting a lot of press.
Now, both open courseware and DSpace are ways that MIT and other universities are asking what should be their institutional role. Do they have an institutional role to play and how can they play it in disseminating and preserving their research output. The question is, why? Why are these questions coming up now? Why might universities start wanting to play institutional roles in the publication process, other than as places where authors happen to be?
The answer is that the increasing tendency to proprietize knowledge, to view the output of research as intellectual property is hostile to traditional academic values. What are some of the challenges that universities see?
I'm not going to talk about cost because that has already been talked about. Most people here know a lot more about it than I do. But I want to come back and repeat some of the things that have been said yesterday about the arbitrary inconsistent rules that universities are supposed to deal with, the impediments to developing new tools for research, and the risk for monopoly ownership.
So let's review -- Jane Ginsburg and Ann Okerson already did this yesterday -- the basic deal as seen by universities. The basic deal is the author, scientist, give their property away to the journals. The journals now own this property and all rights to it forever. Lifetime of author plus 70 years is forever in science. If that regime had been in place 100 years ago, we today would be looking forward to the opportunity in 2007 to get open access to Rutherford's writings on his discovery of the atomic nucleus. This is forever.
Then what happens is, the publishers now take their property and magnanimously grant back to the authors some limited rights that are determined arbitrarily totally at the discretion of the publisher. The universities, who might think they had something to do with this, generally get no specific rights at all, and the public is not in this discussion entirely.
So Jane Ginsburg has already yesterday showed us some nice examples, but let me just put them up again. If I give my property to Elsevier, they magnanimously grant me the right to make photocopies of my own articles, or to present my paper at a conference. Thank you, Elsevier. But I shouldn't beat on the commercial publishers.
Yesterday we heard from the American Chemical Society, which were I a chemist would magnanimously grant me the right to give my paper to not more than 50 colleagues -- I suppose chemists aren't as gregarious as computer scientists or something -- and to post not the text of the paper, but the title and abstract on my own Web site.
But of course, both Elsevier and the ACS are amateurs in this game. When you compare it to a place like the New England Journal of Medicine which we heard from yesterday, which grants to authors exactly no rights, bounded only by fair use law in the United States.
Ken Anderson when I pointed this out to him yesterday told me, that is nearly the journal's policy, not its practice. And they are changing their policy, which is something I applaud. Unfortunately, another part of their both policy and practice is the Inglefinger rule, which says they will not accept any paper that has already appeared in a pre-print archive.
Now, why are universities supposed to accept this deal? Well, because publishing is a serious business. This is a quote from the Nature debate that I absolutely love. Notice, the process either is a stewardship of the journals or unknown individuals. Isn't that great? Unknown individuals. And copyright should not be ceded to individual authors. Where did that copyright come from in the first place? This is an amazing statement.
Why are we doing this? We are promoting the progress of science, and surely quality publication and integrity is important for promoting the progress of science. But there are lots of other things that we could use. There are lots of other things that go into promoting the progress of science.
Paul Resnick already mentioned Google as a research tool. If I would like to know about the HOX-8 gene in zebrafish, I can go to Google and I can type that in, and I can get all sorts of references, unfortunately, not the good ones. Those are locked up behind electronic walls.
I was very encouraged to hear Gordon Tibbitts from Blackwell tell me this morning that they are thinking about a way to allow Google to index peer reviewed literature. But it is not only indexing.
Here is a great little research tool that you can get on the Web. Someone went and made a system that does a concordance of any book in Project Gutenberg. I did some research on Turn of the Screw, which you remember from high school is this wonderful book about evil by Henry James. You can type in the word evil and say, show me -- isn't that amazing? The word evil appears only seven times in Turn of the Screw. If I go to any one of those, I can see the context of it. It is a marvelous research tool. In high school, term paper heaven. I can get a concordance of Project Gutenberg; I cannot get a concordance of the communications of the ACM. This is easy technology.
The important part is that this technology was made by somebody who just did it using public tools. It is easy to make a concordance. What is hard is to get across the electronic fences if these things are not available with open access.
So the question that I want to ask is, are these tools going to be stillborn because everything is hidden behind fences? Probably not, because the stuff is valuable, publishers know it, people are going to invest in it. The more serious outcome is that the spread of these tools will be done in such a way to stimulate network effects that will further concentrate and monopolize ownership of the scientific literature.
You heard about it yesterday, right? If I make a search engine that talks to only the publications from one publisher, that becomes valuable enough for the publisher then to come in and do what you librarians were talking about yesterday, the big deal. This cycle goes on and on and on.
in case you think I am being paranoid, here is one publisher's view, which you might want to read. We will give scientists desktop access to all the information you need, made available to researchers under licenses according to their institutes. If I were less charitable, I could characterize this as megalomania, but in fact, it is a modest little statement made by a modest little company that in 2001 got ten billion dollars of revenues, of which $3.5 billion was profit.
So the question is, are we headed for a place where the scientific long term is restricted with monopoly ownership, or a system that participates through open standards? Gordon Tibbitts talked yesterday -- and I absolutely applaud this -- about Blackwell's support or recognition for the need for open standards. Let me just mention that one impediment to openness is copyright. You have already heard about Creative Commons. One of the issues about copyright which the panel yesterday didn't mention is the default. If you do nothing when you put something on the Web, that is copyright with all rights reserved to you. The assumption that anyone has to make in coming in to that is, they can't use it.
It turns out to be surprisingly difficult to -- in the wonderful legal phrase -- abandon your work to the public domain. That is hard to do. It is even more difficult to specify some rights reserved, not all rights reserved.
This is what Creative Commons is about. We started this as an effort to encourage people to allow controlled sharing of their work and to effectively brand this kind of summarized reserve on the Internet. They are currently a tiny percentage of the Internet which is using this, something like 250,000 Web pages today, but hopefully that will grow.
Pat Brown showed an example of a Creative Commons -- this is an example of a Creative Commons agreement written in a language for human beings. We also have the same thing in the language written for lawyers. And more importantly, the same thing written in a language for computers, so that a search engine can come around and say, can I make derivative works on this thing.
Let me just finish, because the hook is coming out, and say, the world is disaggregating. There is a big game of dis-intermediation going on. The place to look for access is not new technology. This is old technology. The place to look for the action are new institutional players coming into the game. New technologies might do that for universities. I hope that happens, and I have faith that if that does, that will lead to the promotion of the progress of science.
Thank you.
DR. ATKINS: Thanks to all our speakers. We will open up the floor now for discussion. Before I do that, I want to take not more than five minutes and offer up an opportunity to any of the panelists to add comment or question one of their other panel members. Anyone have something they would like to add at this point?
Let's bring on the questions from the floor and from Web cast-land. Let's begin here on my left. Please state your name.
DR. BERRY: Steve Berry from the University of Chicago. There is one aspect of the refereeing process, Paul Resnick, that I think you have dismissed, overlooked, that could be done in other ways. Reviews by and large have a lot of influence on what actually gets published. It is not a simple rejection process. Of the papers that are reviewed and published, a very high percentage are revised as a result of the reviews.
Furthermore, we have to recognize that the reviewing process provides only a very low threshold. It simply says that this material is of a quality that it is worth scientific discourse. So it is not a judgment of whether it is right or wrong. It is only as I say low threshold.
Now, these could be done in other ways, of course. But I think that we have to recognize that in some fields, reviewing even at that low threshold is looked upon as a very important protection.
I have listened to arguments between physical and biological scientists. Physical scientists are as Rick Luce points out very ready to accept the archive model without review first, and online reader review. The biological scientists are worried that without that low threshold review, things could be put online that would be dangerous to lay users. They feel that there is a large audience for biological and especially biomedical articles that simply don't have the judgment that the professionals have, and that it is simply dangerous.
I'm not sure whether I buy the argument, but it is certainly one that one encounters.
DR. ATKINS: Anyone like to pick up on that?
DR. RESNICK: The first thing is, I just want to make sure I emphasize this. I am reviewing these alternative review mechanisms. I think of them as and peer review, not or peer review.
I do think that maybe the peer review -- we could get more out of it than we are getting now, even just the existing peer review, if you made the reviews public. For example, you don't want to let something get out there for the general public to see without anybody from the scientific community saying this is rubbish. Fine, put it out there with the two scientists who said this is rubbish. Why is it better to just hide it than to put it out there with the commentary from the scientific experts?
DR. ATKINS: Any other panelist comments? Let me take a suggestion question from the Web. Referring to the previous talk, and I'm not sure which one precisely that is, might not the thread of the original version of a paper, along with reviewers' comments and authors' revisions or responses to the comment, defense, results, methods, et cetera, as well as the journal editor's additional comments and interpretations, be used as educational material/enrichment for university students? This could be done anonymously.
So it is kind of a suggestion idea. Anyone like to comment on that?
DR. RESNICK: I think it is a great idea. Clearly one of the problems for PhD students now is that they don't ever get to see a paper through its whole process until they do it themselves and cleared it the hard way. So an advisor can show them, here is the reviews that have happened to my papers, but not all advisors do that. I think having it a more public process would be helpful for education.
DR. ATKINS: Rick Luce, did you have something to add?
DR. LUCE: The question always comes up, what is the correct balance between noise and some real value in that discussion, and should the system filter it out or should you let the user filter it out. It is debatable as to where you draw the line there.
DR. ATKINS: On my right.
DR. CAMPION: Ted Campion, New England Journal of Medicine. First, just point of fact. As was alluded to, our practice is that all scientific reports are public, free to anybody that has Internet access, and it is all indexed by Google now. We have been getting that done with the help of HighWire.
But my question concerns your discussion of new players in scientific publishing. I think it seems that one big player that we are seeing has barely been mentioned. That is the public press. Scientific publication, particularly in biomedicine, is largely being judged now, at least by authors and I think even academe, by how much press coverage it gets. It is not just studies of estrogen. Zebrafish and hedgehog mutations are getting into the New York Times, are getting into the press.
One way we are being judged now by authors isn't just boring citation indices, but did Peter Jennings cover it. What effect do you see -- and this of course is all being driven by the public's need and increasingly sophisticated view and interest in science and biomedical sciences in particular, but quite broadly, and what effect do you see this having on scientific and biomedical publication?
DR. ATKINS: Hal Abelson, do you want to start?
DR. ABELSON: I am kind of reminded -- I had a discussion with my daughter, who is in medical school, about MIT's OpenCourseWare, and she is saying that would be a terrible thing for medical schools to do, to put their course curriculum on the Web, simply because you had to be a professional medical student in order to evaluate that, and it would be dangerous to have that information out there.
I don't quite know what you do about sensationalism. I think it has gone through so many aspects of our society, and with many -- you could do a lot worse than have Peter Jennings talk about something in the New England Journal. It has been a tradition in the United States that the cure for speech is more speech. Maybe if there are other channels for people to respond, things would be better. But it is very hard to imagine a path that says we should restrict that in some way.
I wanted to ask you, by the way, is the New England Journal thinking of revisiting the Englefinger rule?
DR. CAMPION: We publish scientific reports. If we judge that something has been published before, we don't republish it.
DR. ABELSON: So maybe part of the answer is that it is up to the press to worry about the novelty and up to the journals to worry about the authenticity.
DR. ATKINS: Any other comments from the panel on this question? Thank you. The back microphone, on this side.
DR. DOYLE: Mark Doyle from the American Physical Society. As one of Paul Ginsparg's first employees through Rick Luce, I guess I am one of the few passionate people that Rick mentions.
Now I work for the APS. Like MIT, part of our mission is explicitly the advancement in the fusion and knowledge of physics. So during yesterday's panel, I was surprised when the question was asked of the first panel, what do publishers do that they think they excel at.
Everybody on the panel just said peer review. They didn't really focus on the other big thing that we at the APS think is very important, which is the responsibility to do the archiving and things like that. We have already gone through pretty much the transition, going to fully electronic. The core of our archive now that we consider the primary output is a very richly marked up electronic file from which print and the online material is all derived. That is the thing that we would like to be able to curate.
What I don't see in DSpace or in the e-print archive yet is any efforts to build that infrastructure for curating that kind of material. It is one of the most important things. The simple question is, here at the APS we are a well-intentioned scholarly society; why don't we just make our journals and overlay on archive or something like that. Effectively, in some ways they are.
I think there are two things. One is, peer review we still think is extremely valuable, and the other is this archiving thing. We seem to be one of the few places that is actually able to curate these things, we being the publishers. An exception would be NLM, which I think we will hear about later, where they have taken this approach to building archives of XML and things like that.
So I was just wondering if people in the panel could comment on that aspect of doing the preservation.
One other thing I would like to comment on is, the low cost of arxiv.org. The key problem here is that there is a two to three order of magnitude difference between what it costs the APS or another publisher, that $1500 per article and what it costs to do dissemination in the archive. That is really where all the tension in the economic models come from. When you have that large orders of magnitude difference. That is what puts all the pressure on people to change the way that things happen.
That's it.
DR. LUCE: Two threads. The first thread is the archiving question. It is an interesting question. I don't think that libraries generically are going to be able to pull it off. I do think that there are probably a dozen to two dozen libraries globally that would see that as an important role, a significant thing to make an investment in, because they are thinking about centuries, not years or decades. That will move forward.
There are some publishers who are quite aware of the problem, as you are, and making investments. There is a vast majority numerically of publishers who are just simply too small to have the wherewithal, both technologically and financially, to be able to pull it off.
So it is going to take some sort of a hybrid relationship between some publishers and some libraries, a small set of libraries, who see that as their role.
Your second question was I believe related to cost?
DR. DOYLE: It wasn't so much a question as a statement, the fact that there is a two to three order of magnitude difference in the models. But those costs need to be recovered if we are to preserve the two important things that the APS does, which is the peer review component and the archiving component.
Actually, I do have a way to turn that into a question. For that $1500 per article, would libraries be able to actually -- since the large part of that $1500 per article is going into producing that archival XML file, would libraries be able to then take that -- we would be able to give back to the libraries the thing that they paid for, but would they be able to curate it correctly, and is that an incentive for institutions to pay that $1500 per article, rather than authors, if it could feed more richly structured data into their institutional repositories.
DR. LUCE: I haven't looked at how that would scale. The issue that I wanted to respond to related to cost -- and I think Mike Keller touched on this yesterday in his talk -- some very small number of publishers have actually started to rethink and make progress in terms of how production occurs. Therein lies some cost-saving opportunities.
The large majority of publishers that I am familiar with are essentially taking a paper model, electrifying it, it is an analog, and so they are saying we have these enormous costs, without really looking at how do you need to produce this in a different environment. Until they make that flip, it is very difficult to talk about what costs actually ought to be as opposed to what they are.
DR. ATKINS: Hal Abelson, do you have anything to add?
DR. ABELSON: Yes. When we designed the DSpace, it was completely, absolutely essentially deliberate that this was housed in the MIT libraries. The reason is that whether or not the MIT libraries will preserve something for 200 years, they sure as hell will preserve it for 50 years. We wanted to work with an organization that understood what archiving meant, and also what curation meant, because that is what libraries do.
So just to build on what I was saying earlier, the critical thing is not the technology that one can put up and archive someplace, because that technology is well under control. The critical thing is to find an institution that will say, as part of our core mission, we are dedicated to keeping this around and preserving it.
If it is part of the core mission of the ACS to be the repository of all chemical literature and have every other organization in the world be your franchisee, that is an important thing to say.
The other thing I wanted to say. I completely agree with you, it is very important not to get trapped into the idea that you are going to be building the monolithic, end-to-end solution. So DSpace will never be that. DSpace might be a place where people building peer review systems and curating systems and all kinds of other systems can link to and build on.
But the trap, and this was alluded to yesterday by Blackwell, is that you don't want to be in a position where people build the whole system. You want to be in a position where there are elements that are communicating through open standards, so lots of people can come in at different places in the value chain and add value in different ways.
DR. ATKINS: We'll go to the side microphone.
DR. MC HENRY: Bruce McHenry. Bruce A. McHenry if you are Googling me. You will find my Web site at discussit.org. Right above the reference to Discussit, you'll find a reference from one of the LCS lab computers that says, Bruce A. McHenry, I am a schizophrenic foosball. If you have been at foosball at MIT, I guess you have a right to be schizophrenic.
Basically, I am working to found a company and raise several million dollars over the next few months to build a layer of protocols on the Web which will allow for annotations associated with credits and debits. From those credits and debits will flow reputation information, which will help to promote or demote pieces.
I think it is going to be part of basic Web infrastructure and an operating system with network effects and lock-in potential, that needs to have a significant discussion about how much of it should be publicly funded and how much should be privately funded.
But initial markets are not going to be academic publishing, so the New England Journal can rest assured that they won't be invading your space right away. The initial markets will be areas where people are paid a lot of money to do what they are doing, which is legal investment banking, consulting, writing software, and eventually it will trickle down into academe. But academe will be at the leading edge of the rise of the S curve, and so that is why I am here, and that is why I stayed at MIT for graduate school.
I want to pick up on a thread that Hal started with the slide about changing the basic deal. In academe, the basic deal is, you get to have the keys to the universe, know the secrets of how the world works. In exchange for that, you take a vow of poverty, often.
That basic deal does not play very well to mothers around the world. Mine is asking me every time I talk to her, have you made any money yet? It doesn't play particularly well to adolescent girls, where the choice is between Britney Spears and maybe learning science.
So changing the basic deal could start at the top. Paul Resnick mentioned teaching, research and service as the three missions of faculty. I'm not so sure about the way the service mission is being performed, and I'm not so sure about the way publications are being run by boards of peers, selecting articles for submission, for publication.
I think actually, the process of selection is probably inverted, in a sense, because the articles which are of most interest to the audience, the audience knows about, not necessarily the people who are on the review panels. So the selection process should probably start with the audience and then be raised up to the experts in the field for corrections, embellishments and inclusion in the historical record.
So changing the basic deal to me in academe would like something like this. Instead of being awarded research dollars to go off and do research, or maybe in addition to having research dollars to go off the do research, you get substantial amounts of money to give away to others whose research you deem useful and interesting to you. This could apply even to students, that a significant portion of their tuition be given to them as money to be used in the system to buy the work of their peers and also the faculty and other experts in the society.
That changes the whole model to one which is much more monetarily driven, probably. I know there is a great deal of resentment towards any kind of suggestion of that among academics. However, one only has to look at the example of the Soviet Union to see that monetizing things depoliticizes the process of creation, and greatly improves the quality and quantity of the content.
DR. ATKINS: So is there a question there? Paul Resnick has a comment on that comment.
DR. RESNICK: Just the fact that you are starting the company makes me want to ask a question about other institutional players who might be coming in besides universities. Do you see anything else on the horizon? Are there more Googles? Should we be expecting General Motors in this realm? Is it only universities that are the new player here?
DR. ABELSON: I think from the point of view of libraries, the interesting shift is happening at universities. Traditionally, the role of a library at a place like MIT has been, bring in all that stuff from outside to support the research going on in the institution. The interesting shift in something like DSpace is, the library is saying maybe we have another mission or a slightly different mission, which is to effectively be a place where the research from the institution is disseminated.
That is kind of the message that is going on in this DSpace federation, of universities playing with the idea of, does it make sense to view their mission in a very different way.
Now, I haven't actually seen -- I have talked to people in research libraries from large companies, and I haven't seen that as a theme that people adopt, although on the other hand, you can imagine a library at a place like IBM looking at itself effectively as a piece of the marketing arm of a company. If you go to IBM or Cisco, there is tremendous resources that you can get there, but they are not quite seen as a link with the research libraries of that institution.
So I can imagine that sort of thing happening, but again, the point is, there is lots of room for many, many different players to start looking, as I said, not at the whole thing, but pieces of the thing. You can imagine a certification place, you can imagine a place that does professional peer review. These opportunities start coming up.
DR. ATKINS: We are going to start here and then we'll go around.
DR. KING: Thank you, Dan. Donald King from the University of Pittsburgh. Looking at the amount of use, the types of use of articles that are written in science and medicine, about two percent of those articles are used for citation purposes. About 25 percent of those articles are read by academicians, and about 75 percent of those articles are read outside of the academic community.
What I am suggesting is that when you begin these feedback systems, that you think in terms of the enormous value that is derived outside of the academy. There are two purposes for doing this. One is that it is a better metric. It will achieve a better metric for assessing journals and authors and all that kind of stuff, but it also will begin to develop a means of the authors recognizing that their audience is outside of their peers.
I have done a lot of focus group interviews and in-depth interviews of faculty, and they seem to think that they are writing only to the people they know, their immediate community. I think there would be some value in the system if there was some kind of an acknowledgement or recognition that there are other uses of that information outside of the Academy.
Thank you.
DR. ATKINS: Any comments? Go ahead, Paul Resnick.
DR. RESNICK: I think that is good. I think you have a chance to start getting some of this feedback and also usage data from outside the academy.
As you point out, feeding that back to the individual authors, it is not just that you want other people to be able to evaluate whether my stuff gets read; there is actually very little feedback for authors about what is happening with their works.
DR. RHINE: I'm Lennie Rhine, University of Florida. I have no project to promote or position to defend, so I just have a question. How do universities in the tenure process adopt to the open source environment?
DR. ABELSON: Could you say a bit more? I'm not sure what you mean by that.
DR. RHINE: Well, within academia, I think that it is driven a lot by the tenure process. I think that most academics hold on to this as a mechanism to rank hierarchically, to evaluate, et cetera. So if you are going outside this vehicle of peer review journals which is used so heavily in the tenure process, how do you incorporate this more ephemeral open source information into that process? Does that help?
DR. RESNICK: I'll just tell you what Hal told me before the panel, which is an idea for collecting a lot of these metrics. Right now in the tenure process, they get the journal rankings and they count the number of journals, and different departments weight it differently. They may or may not construct a numeric score, but they all have it in their heads.
You might actually have an open system for computing metrics like that. Think of how U.S. News & World Report does their rankings of schools; they have a particular way of weighting everything. But now, imagine a more open version where we collect all the data, we know what things have been cited, we know what things have been read. We have all the reviews, we have both the behavioral and the subjective feedback. Then you let the teaching institution that has its certain tenure criteria create its own metric based on that, and the research institution will create a different metric, and you can have lots of different ways of using that data.
DR. ABELSON: I think specifically with respect to the tenure process, one of the marvelous things that the NSF did several years ago was to limit the number of papers that one can cite in preparing a grant proposal. It is not really a question of quantity or vast numbers of papers published in third-rate journals.
One of the things that we have been trying to do at MIT is effectively to point out when you come up for a tenure case what are the three things on which I should be evaluated, and to try and get this whole enterprise moving more in terms of what Alan Kaye used to call the metric of Sistine Chapel ceilings per century, rather than papers per week.
DR. MOLHOLT: Pat Molholt, Columbia University. I just wanted to offer an example that I think is a hybrid of a hybrid, putting together some ideas from Abelson and Luce, that is active at Columbia at this point.
There is something called CIAO, Columbia International Affairs Online. It is from within the libraries. It is a publishing arm that acts collaboratively with peer institutions to assemble material in sort of a pre-print way. It is technical reports, reports of conferences, other gray literature.
It is then packaged along with other aspects, some news reporting and some more ephemeral material, and it is packaged back out for sale to libraries. So it comes from the library, it is packaged out and sold to libraries, but it also has an element within it that is freely accessible for high school-junior high school teaching that is public and open and can be accessed by anyone. So it picks up on pieces of a number of areas.
That is in international affairs. They are doing one now also in earth sciences, and we are contemplating one in alternative medicine.
DR. ATKINS: Thank you. We'll go to the back microphone, then the front.
DR. BLUME: Marty Blume from APS. First, I'd like to make a cautionary remark about metrics. In fact, I think these have been addressed by Paul Resnick.
I was happy to hear Rick Luce use the word indicators rather than metrics, because there is no one number that can be used as a measure of quality. There are things wrong with all of them, and there was an excellent list of gaming and the like that was put up there.
There is a very nice Dilbert cartoon, for fear of violating their copyright, but it does show the human resources manager saying metrics are very important, a very good one is the rate of employee turnover, and the reply by the manager is, we don't have any turnover, we only hire people who couldn't possibly get work anywhere else.
Many of the metrics suffer from this, and they can be manipulated. You really have to inquire into them and use them as indicators, and it takes a fair number of them if you are going to get a fair measure of quality.
I wanted to also comment on peer review. One of the things that one has to worry about in the case of public comment, although I approve of that, I think it is a good thing to do, nevertheless there is a sort of Gresham's Law of refereeing, in that the bad referees tend to drive out the good ones. All of us who take part in listservs of one sort or another know of the loudmouth who will not be contradicted or denied, and eventually the rest of us give up and say we are not going to take part in this anymore.
So you do have to expect something like this, and you have to have a degree of moderation in it. Lo and behold, what is that but an editor. So this is another piece of the peer review process.
Also, the knowledge that a paper is going to be peer reviewed does have an effect on an author. It means that they try to improve it at first so that it will pass this barrier.
I do have some statistics on peer reviews from our journals. We generally look -- this is a matrix; we look at the first 100 articles submitted and track them through in a year for one of the journals, and track them through to see what has happened to them in the end. You can imagine these having been put up -- all of them having been put up on the e-print server, and then follow them in that way.
But of the first 100 submitted in one of the years, 61 were accepted and the remaining 39 were rejected or recommended for publication elsewhere. Of the 61 that were accepted, 14 were approved without any revision after one report, 22 were approved after resubmission after some modifications, 14 after a second review and four after the third resubmission, all of these leading to improvements. Some wags say that the improvements for some of them are largely adding references to the referees' papers, but even that is an improvement if there is not enough citation of other work, something that we see now. Of the rejects, 19 were rejected after one report, 13 after two, four after three, two after four and one after six reports. It is much more costly and difficult to reject a paper than to accept it.
But this gives an idea of some of the things and some of the value added in the course of traditional peer review which I know was not under attack in this case. It is something that has to co-exist with the other pieces.
We tried to avoid using referees of the type that would lead to Gresham's Law that we see on the listserve. We are aware of them, and try to select accordingly.
DR. ATKINS: Thank you.
DR. RESNICK: I have one response and then I have one question back to you. I'll ask the question first. Do you know what happened to the 39 that were rejected? Where did they go after that?
DR. BLUME: No, we don't know. We do have some measure of this. In our letters journal we have found that in our letters journal, Physical Review of Letters, the acceptance rate is 35 percent. Many of the papers are rejected. It is not an indication of low quality, because there are other criteria, including the breadth of interest of the article, which is rather more subjective. But we have seen in another letters journal -- one of our editors has gone through this, looked at the titles, and found that a full 40 percent of the articles in that journal in the first six months of the year had been rejected from Physical Review of Letters. Otherwise we don't track them, and we can't tell.
DR. RESNICK: One of the reasons I ask about that is, just as there is a confirmation bias that we only publish experimental results where the hypothesis was confirmed, we also only let people know when a paper has been accepted; we don't let people know when it has been rejected.
I do want to comment on the Gresham's Law of reviewers. One way that you can evaluate the reviewers is to have an editor or some person in charge who picks them or moderates. But you could come up with some systems where you calibrate reviewers against each other.
DR BLUME: We actually do this. We keep a private file based on the reports that we receive. The individual will give us a detailed report, a reasoned report. It is something which says publish or reject, which we largely ignore, that sort of thing. And unfortunately, this leads to an over-burdening of the good reviewers, so you are punished for the good work that you do.
DR. RESNICK: No good deed goes unpunished. I would just point out that that kind of system that you are using privately and internally, you could imagine some version of that in more public systems as well. So if you go to a more public system, you don't necessarily have to have all of the lousy reviewers get equal voice.
DR. BLUME: If we do that, we would certainly want to pick people, and would probably continue to do it anonymously. A reviewer is always free to reveal that that individual is the reviewer. We will not.
DR. ATKINS: Front microphone here.
DR. FRIEND: Fred Friend, Joint Information Systems in the U.K. I'd like to ask Hal Abelson how he sees the future relationship, long term relationship, between his tutor repositories and traditional publishers. I can see that repositories have a very valuable role in shaking up the system and in helping us to establish or go back to better priorities in scholarly publication, but what is the role for traditional publishers in the long term in that situation?
Hopefully publishers will respond in a positive way to these changes, and may come out in the end with a better role than they have at the moment, or could institutional repositories take the place of traditional publishers completely?
DR. ABELSON: I think it was Yogi Berra who said it is really hard to make predictions, especially about the future. But I think the main point is that as you have new players, you have different kinds of roles. I don't see any inherent hostility between institution archives and traditional publishers.
MIT for example has a very respectable journals operation in the MIT press, and we are looking for ways to find joint projects between the press and the DSpace archives. I think what would be very interesting is to see some kind of structural track from pre-prints through certification through authorization process and be able to see some larger piece of that whole process that would be done by a bunch of different institutions coming in collaboratively.
You can imagine just off the top of my head a university holding both the pre-print through the edited version, the journals coming in with some kind of authentication and review cycle. I don't know, I just think there are lots and lots of opportunities. The trick is to free up the system and allow other players in who will provide pieces of that process that the journals for various reasons haven't been.
The danger as I said is, some individual player wants to come in and lock it up in some complete system and says, I own the whole thing. The problem with the World Wide Web has always been that everybody wants to be the spider at the middle of it, and that is the thing that we have to resist.
DR. ATKINS: I think there is a suggestion of an answer to your question in one of Rick Luce's slides, where he was showing that the pre-print servers or the E-print servers are these repositories at the lower layer were providing a platform or an infrastructure in which a whole host of yet to be fully imagined value added entities could be built on top of that, some of them for-profit entities.
So the idea is to create a more open environment for the kind of primary or upstream parts of the value chain, and then to encourage kind of an economy of activity on top of that.
DR. FRIEND: It is just that traditionally, we have looked on publication as being the record. Yet we seem to be saying that long term archiving is not for publishers. So that perhaps rules the record function out for traditional publishers. So what are they left with?
DR. ATKINS: I won't answer that question, but I will just comment that one of the themes that came out loud and clear in the cyberinfrastructure study was the huge latent demand for serious peroration of scientific data and mechanisms for the credentialing and validating of scientific data and for mechanisms for encouraging interoperability between data in different fields as people create more comprehensive computational models of ecologies and environments and so forth.
So one of the recommendations in this report is for NSF to step up to the provisioning of leadership in that arena, and of course, at some level there is a lot of synergy between that and some of the long term institutions. In fact, the volume of bits that for example just the high-energy physics community generates probably exceeds even in a year most of the scientific literature worth keeping. So there is a huge scale there.
DR. LUCE: Just to make a quick follow-on point, certainly representing the governmental sector, institutional repositories, is also an issue of deep concern, in terms of being able to have access to things created with public monies, publicly available.
One can easily imagine a system again that takes into account, if that is at the bottom level, things like usage activity, annotation systems, where there are lots of annotations going on. That would suggest that here are some nuggets that perhaps publishers in a different sense at a different strata might look at, and start to mine that for opportunities to more formalize or put a different value-added spin on that kind of information.
So I don't think these things have to be competitive. In fact, they can exist in a way that is very, very complementary.
DR. ATKINS: Thank you. That was Rick Luce. We'll go to the back here.
DR. KRELLENSTEIN: I'm Mark Krellenstein from Elsevier. I thought people might want a live target up here that they could talk to, so I am offering myself.
First a correction. The profit numbers, revenue numbers, that you quoted weren't correct. Elsevier certainly is a large successful company that is very profitable, but the actual revenue number is a little bit less than the number you talked about. Elsevier's part of that is less than a quarter.
Secondly, you talked about Derek Hanck's comment about producing a universal search engine for license and proprietary content. That really came in response to our users and to the libraries who have told us what their users want. What we hear is that people want the answer to Google for licensed content. They really want as close to a universal search engine for proprietary materials as they have with Google for the non-proprietary.
The idea of multiple players in the value chain, which I think appeals to expert researchers such as me or to you, is less appealing often to undergraduates in particular, who really want to the extent that it is possible a single solution for all their search needs. We don't expect it to be really a single solution, but as much licensed content, ours and other publishers' also, that it is really an open platform that we are selling, not a licensed set of data.
As much as we can put together, we are trying to do that. No differently than Google. We are charging for it because of the model we have for charging for content. Google supports itself via advertising. That could be an option perhaps for a company like Elsevier, but probably not in the scientific space to make the kind of revenue that Google makes, or to make the kind of revenue that is necessary, given other models discussed here. That is a separate discussion, about how we pay for that.
The other point I wanted to make is that we are open also to other players doing the same kind of thing. There is a metasearch initiative going on right now in NISO, which again is trying to respond to libraries' request to have a small number of services for proprietary content, rather than that long list of 60 providers. You go to the library and you see all the electronic services that are available.
So we are working there together with NISO and other large publishers who are here to develop open standards, so that any metasearch company could come in and metasearch these proprietary services.
Finally, we offer a service called Scirus right now, which does provide access to some of that hidden content that Google does not provide access to. It is a free service, www.scirus.com. It has scientific papers from the Web. We only cull the scientific part of the Web. It has got 150 million Web documents. It has all the Elsevier proprietary content. It has other publisher proprietary content, whatever we can license from other publishers. That content is available for a fee. The abstracts are free, but if you click through to that content, if your site is licensed, and you go right through to the full text, and if not, there is a pay per view model, at least in the case of Elsevier.
I guess my question for Professor Abelson is whether you don't see that need also for more consolidation, at least for undergraduates in certain kinds of research at the same time that you want this open federated capability and the ability for many, many players to come in.
There is a simplification. What Google has in fact done is, it has created success, where something equivalent is not desirable to some extent, not completely, but to some extent on the licensed content side.
DR. ATKINS: Any followup? We have ten more minutes left in this session. If there are any of you listening through Web cast that would like to send in questions, please do so quickly. Don King?
DR. KING: Thank you, Dan. Donald King again from University of Pittsburgh. I really am an introvert.
There is another dimension that I think you need to consider when you are assessing the materials that are published. That is the dimension of time following the publication of the articles, or the availability of them in the pre-print archives.
The median age of a citation is something like six to eight years, depending on the field of science. Most of the reading that takes place, about 60 percent of it, takes place within the first year of publication, but almost all of that is for the purpose of keeping up with the literature, knowing what your peers are doing, and things of this kind. Something like 40 percent of that information is already known by the reader.
As the age of the material gets older, the usefulness and the value of that material gets older. But about ten percent of the articles that are read are over 15 years old, and for the most part, particularly in industry, that is particularly useful information.
One of the reasons for that is that science and industry oftentimes reflect the needs of the organization being served. In the academy, you tend to follow a line of research over a long period of time, but in industry you are assigned a new area that has to be followed. Oftentimes that requires a need to go back into the literature much more thoroughly than you have.
All I'm saying is that there are some things that you can begin to get feedback in that will help make that better than it has been in the past.
DR. ATKINS: Thank you. This is the last call for comments or questions.
DR. SMOLENS: Michael Smolens, a company called 3 Billion Books. Just to comment, I haven't heard a lot over the last day and a half about a term that I will call cultural diversity. There are a lot of different cultures and language groups in the world that have a lot to say about a lot of the issues that are being discussed here.
I just want to make people aware of an organization that started in 1998 called the INCD. It is the International Network of Cultural Diversity. It was started by someone in Sweden who got 30,000 artists, writers, musicians together because they couldn't deal with the European Union in their own language of Swedish; they had to deal with it for patent and litigation and other issues.
Their goal is to try to have the issue of very small cultures and language groups be heard at international meetings and consortia, so that when the World Trade Organization is dealing with trade issues, someone there is at least thinking about the fact that language groupings are disappearing very rapidly and culture should be maintained.
So I just would like to point out that the cultural diversity issue around the world is a very, very sensitive issue that I think everyone needs to keep on the top of their minds when they are discussing a lot of these issues.
DR. ATKINS: Thank you for that comment. Anything else? I declare this session adjourned. Thank you very much.
(Brief recess.)
(ADMINISTRATIVE REMARKS)
Agenda Item: Panel 5: What Constitutes a Publication in the Digital Environment?
DR. LYNCH: Good morning. I am Cliff Lynch from the Coalition for Networked Information. I will be moderating this session. For those of you joining us through the Web, welcome back, whatever time zone you may be in. Let me remind those participating through the Web cast that they can submit questions which we will take later through the National Academies Web page.
This is the fifth and final topical session before our wrap-up session this afternoon. This session was designed as part of a pair with the earlier session that Dan Atkins moderated this morning, which I think almost all of you were at.
In this session, we want to start taking a look at the issue of what constitutes a publication in this digital world that is evolving, with specific attention of course to the frameworks of science and of journals.
I do want to say parenthetically that while we are going to focus on science here, there is lots of action in scholarship broadly beyond science, engineering, medicine and technology. For example, the humanities have been very active and very creative in exploring the use of the new media. I refer you among other sources to a colloquium that was jointly sponsored by the National Research Council, the Coalition for Network Cultural Heritage, my organization, and others in January of this year, to take a look at some of what the humanities are doing.
As we look at this question of what constitutes a publication, how the character of publications change, I am hoping we can switch our focus from the environmental questions that Dan's panel was talking about, which dealt with publication as process, with the flow, with how we select, how we disseminate, to hear questions about how we author and how we bind together pieces of authorship into structures like journals.
If you think about it, I believe we can approach this from two kinds of perspectives. One is to recognize that -- is to take it from the individual author's point of view, to recognize that as has been well documented for example in the cyberinfrastructure report you heard about earlier, the practice of science is changing. It is becoming much more data intensive. Simulations are becoming a more important part of some scientific practices. We are seeing the development of community databases that structure and disseminate knowledge alongside of traditional publishing.
In that kind of a world, you can ask questions about how do people author articles, how should they author articles. It is clear that articles are or can be in the digital environment more than just paper by digital means. It is a sad fact, if you look at most of the journals on the Net today, they really are paper by digital means. In fact, they are often printed for serious reading and engagement. We still are using all of this technology around an authorship model that is strongly rooted in paper.
We can add multi-media. There are trivial extensions, but there are less trivial extensions, too. How we structure our arguments, recognizing that as Hal Abelson hinted earlier, not all of our readers are going to be human. Programs are going to read the things we write, and programs aren't very bright sometimes, and you have to write differently for them. So that is one perspective we can look at.
The other is the perspective of the journal, of the aggregation of these articles, about recognizing that the ecology in which journals exist has changed radically. It used to be that the other things in that ecology were other print journals and secondary abstract and indexing sorts of databases. Now it has become very complicated. There are all kinds of data repositories. There are live linkages among journals. There is an interweaving of data and authored argument which is becoming very complex.
So these are the kinds of questions that I hope we will have an opportunity to engage in our session this morning.
We have three speakers on our panel. Each will speak for about 15 minutes or so as we have done in the other panels. After all three of our speakers have been heard from, we will open it up to questions, and I will moderate.
Let me very briefly introduce our speakers in the order that they will be presenting. You can find longer biographies in the packet that you've got, so I will be very brief here.
Our first speaker is Monica Bradford. She is the Executive Editor of Science. Our second speaker is Alex Szalay. He is the Alumni Centennial Professor at the Department of Physics and Astronomy at Johns Hopkins. Our third and final speaker will be David Lipman. He is the Director of the National Center for Biotechnology Information at the National Institutes of Health.
I will invite Monica to come and give us the first talk.
MS. BRADFORD: Thank you, Cliff. It is a pleasure to be here this morning with you all and to tell you a little bit about Science's STKE, which stands for Signal Transduction Knowledge Environment. I am going to tell you a little bit about the history of the publication, a little bit about the current status, talk about some of the specific issues related to defining what is a publication in the digital environment, and then if time permits, talk a little bit about how we have tried to use the power of a more traditional publication that you may be aware of known as Science to help move forward this slightly less traditional project.
This was started in 1997. At the time we thought it was a bold experiment, even though we were a bunch of old players, I just learned, but we were feeling bold at the time.
It is a joint project between AAAS and Stanford University Library and at the time, also Island Press. The reason the three groups came together is, Stanford University Library was very interested in making sure that the nonprofit publishers and the smaller publishers were able to be players online as we moved into the digital environment. They also had started up HighWire Press. We had at that time put Science online, and we were excited about the possibility of working with them on new technology ideas.
Island Press is a small environmental publisher, primarily books, but they had ties to the Pew Charitable Trust, who was interested in funding some kind of experiment online. Island Press was helping to determine what the right area might be for that experiment. They were particularly interested in the intersection of science and policy.
Then of course, AAAS -- we had been so excited about putting Science online and all the potential that we felt the online environment offered to us, that we were eager to try something new. Floyd Bloom had helped us get up Science, and it seemed like such an easy thing to do. So what was next? We could do everything, we thought.
Specifically the goals of the knowledge environment were to move beyond the full text journal online and to provide researchers with an online environment that linked all the different kinds of information they use, not just their journals, but link it together so that they could move more easily and decrease the time that was required for gathering information, giving them much more time for valuable research and increasing therefore their productivity.
Why signal transduction? That was the first area we picked. The reason was primarily that our funders wanted us to try to find an area that hopefully at some point could become self sustaining. So therefore, science at the intersection with policy was quickly eliminated, particularly because so much of the literature is actually what we refer to as gray literature. It was not digitized, it wasn't clear how you would get there, so we moved to an area where we at AAAS and Science in particular were very comfortable, and that was signal transduction.
As you can see, it is very interdisciplinary within the life sciences, biological sciences. You have cell biologists, molecular biologists, developmental biologists, neuroscientists, structural biologists, immunologists, microbiologists, all of them at some point in time need to know something. They come to a point in their research where they need to know something about signal transduction.
Also, there were some business reasons, though I am not talking about cost or revenue, the kind of things the publisher typically looks for. We thought there was a broad potential user base. Both industry and academia was very interested in this topic. It didn't have one primary journal at the time; the information was spread across a lot of journals. There was no major society. But more important, there were some things about the study and the research and the kind of information that were most important to this, and for our reasons of wanting to pursue it.
The area of signal transduction is very complex and the information is widely distributed. We felt it was important to be able to create links between these discreet pieces of information if we were going to push knowledge forward. We felt that there was a potential here by making these links to have substantial gains in practical and basic understanding of biological regulatory systems. We felt that it was an ideal place for AAAS to begin such an experiment, because it would reach across disciplines, and after all, that is what AAAS is all about; we are interdisciplinary.
To be honest, the information in signalling had outgrown what you could do in print. It really called out for a new way of communicating.
One thing that we were a little surprised to find out with it, not only did we have to answer the question why signal transduction, but for business reasons we had to answer what is signal transduction. While to our researchers in our community we were serving it was very clear what we meant, as we tried to move this into the business world, we had to explain what signal transduction is, why should a library care about a knowledge environment around this topic. We explained to them that a lot of their different researchers and a lot of their different schools and departments would be interested in this. So there has been a bit of education that had to go along with the process.
Where did we go? This is what the signal transduction knowledge environment looks like today. When you look at it now, it looks like some modest steps, but I can tell you that over the time it has been seen not to be modest steps. It has been a hard process. It is the first of the knowledge environments hosted at Highwire Press; I think now there are five.
When you look at this, we now consider STKE as part of the suite of products that we refer to as Science Online. I also wanted to say that one of the next areas we are moving into is the biology of aging, because it has some of the similar characteristics that I mentioned before, about being interdisciplinary and not having a home base and a need for scientists to talk across fields.
On a more personal note, after these six years with STKE, I want to know as much as I can about the biology of aging, because believe me, we have aged.
But as you can see at the top there, we have some pretty traditional categories for content. You have your typical literature. We tried to create community, so there is an area with tools that relate to community, and resources that scientists would use, and then the most interesting part and the part I'll spend most of my discussion on, which is the connections met. On the left-hand side are pretty much your tools and your services and your top navigation is really your content navigation.
Let's take a look at this. On the macro level, what is a publication? We are used to our Science magazine, we're used to what it looked like when it went online, and it has now supplemental material. So parts of STKE are very familiar; they look like what you would think of as a publication. But it is interesting, in that it is really a collection of a lot of items that used to be considered each as a separate publication.
We have connected them all together in this environment, and we consider this the publication. It has its own ISSN. It is indexed by Medline. The updated reviews, the protocols, the perspectives, they all have their own DOIs. So in many ways, that part of it is very familiar and very similar.
But then we combined it with these other aspects to make a larger publication. I think this is what you will see in the future. We will move away from just putting the journal online and trying to pull it together.
The virtual journal is interesting, in that it is full text access to signal transduction related articles in 45 different journals. Some of who are represented here, wonderful people, who are wiling to go into this experiment with us. We refer to them as the participating publishers.
Their content when it goes online at HighWire, we have an algorithm that runs across the content and selects out for that content which is related to signalling. Subscribers to this knowledge environment can access that content. I think later on in the question and answer segment, it might be interesting to hear from some of them as to the pros and cons of this. One, it is a marketing tool and two is whether or not there is concern about what it does to their own content.
Then community is letters, discussions, forums, directories. To be honest, that has been the hardest part to develop. We thought that that would be maybe the most exciting or interesting, that people would have this place to talk across fields and talk to each other and with scientists that they don't normally see at their meetings, but in fact, that has been the hardest part to really get going.
Learning services. Weakened signal transduction as written by editors. It is reviewing the current literature that comes out and highlighting what is important. We have things like you see other places, custom electronic TOCs and personalization, which is another part that we hope makes the knowledge environment really useful, the ability to put things in your own folders, to save your searches, to relate resources. We have the function to do that as an editor or at the level of an individual user. Then you can also set your display settings either for only display things since the last time you were there, or content from your selection of journals.
So I think in that sense, in the macro level, it has been accepted as a publication. We have 45,000 unique computers visiting a month. About a quarter of those come back more than once a month; 30,000 requests for full texts, original articles, 5,000 PDF reprint downloads, 10,000 connection map pages viewed, and at the bottom, a little bit about the efforts to become self sustaining.
I guess the part I would like to look at now is, within this larger publication that is a little bit more traditional with its ISSN, are there parts of it that in and of themselves are new and a new kind of publishing and a new kind of publication. So I am going to just focus on the connections map.
It is basically a Sybase relational database of information about these signalling pathways. Each entry within the database is created by an authority. The authority are solicited by the editors at STKE. On top of this database there is a graphical user interface which you can see a piece of it there on the right. In addition, in order to create this database, HighWire worked with us to create what we call CMADES, which is a software that is downloaded for the authority to work on to enter data into the system. As time goes on, we think the real value of this will be adding bioinformatic tools on top of this database to even further find new and interesting information.
This just gives you an idea of what the display looks like. You can see each of those circles are a component. You can click on those components, and if you click on any one of these components, you will go to its database record. It tells you what it is. You can use this to move around.
The pluses and minuses which don't line up too well, we have some issues about the display, those represent the information about the relationship about the different parts. Up here at the top it shows you the different colors so you can identify what part of the cell this component -- where in the cell it resides.
The connections map is pathway-centric versus some efforts that are being done that are molecule-centric. We found that we think that in the long run, this is going to be an interesting way of synthesizing information. It is going to have a lot of use.
We think that the authorities that work on these are willing to work with this because we had existing relationships with them as authors. They trusted us to put out quality content, and I think they trust us that we are going to try to make sure that this kind of effort is recognized, and we have a reputation for being reliable in this sense. So if they are going to put this kind of effort into creating something, it is not just going to go away tomorrow. I think right now in this time of experimenting, that is really important. If a scientist is going to do something like this, they want to make sure that it lasts.
This just gives you an idea of the way the connections are organized. This is what it would look like when you click on a component. It tells you the name of the component, the scope, what kind of pathway it is in, abbreviations, types, subset or localization, upstream, downstream components. If you look, it tells you who the authority is right there, and there is a link so you can contact the authority that is responsible for the information that is being displayed here for the entry of that information.
So at this point, the question is, is this effort moving beyond what we would just be called a database entry to a true publication? For each of these pathways, the authors are supplying abstracts, and metadata is being created at the pathway level. We are hopefully going to try to submit that to PubMed and see what happens when we send the rest of our information.
Each pathway has an identified author, which is an authority, and they can work in groups. They bring in other people, because then these pathways are so complex, you need more than one authority. They synthesize and vet vast and sometimes conflicting literature. We do edit the data in the database before it is released and approved, though this process if far from perfect, and it is more difficult than we thought it would be.
These pathways are reviewed at a specific point in time. Each year we are going to have one issue of Science that focuses on the pathways. Right before that point, we try to review each pathway that is going to be highlighted in that issue.
The reviewing process is very hard, because you have to go through a lot of information. It is all networked. One of the things that we need to develop are tools for the reviewers, how to be able to know systematically what you have looked at, how it is connected, at what point you saw it, and have some sort of printed output for them to look at, because navigating through the database to review we are finding is difficult, and we are looking for new ideas of how to do that.
These can be updated as needed. At the bottom of each component information, you will see a date and time that it was updated. Then as I said, the Viewpoint in Science provides a snapshot of the state of the knowledge in this pathway at the time of review. But they continually can be reviewed if there are significant changes to the pathway, or changes in the authorities that are involved.
Just quickly, I wanted to tell you a little bit about CMADES, which is this authority tool. We felt that the way this was going to succeed was to get topnotch scientists who were recognized experts, so we had to give them tools that made this activity easier and more productive. That is what CMADES is.
Also, we have found some of the pathway authorities say it helps them organize their own information. So in other words, it is a tool that they feel they are getting benefit at, beyond just populating the database.
There are fields for descriptive text, control vocabularies. Some fields required are more optional. The tool includes an ability for automated citation searching, and then you can bring the citations into the database. So for each component, you would have a list like the journal articles or the things you should read to understand more about this particular component.
To give you an idea of what has been established in the literature, they have a graphical tool for organizing the pathways, and also ability to link out to nucleotide and protein databases searchings, and bring those links into the information.
I'm just going to show you really quickly, this gives you an idea -- this is a jumbled slide, but at this point, you can see where they have texts that can describe it. This gives you the different components. This is a controlled vocabulary. Under here you can see the drawing tool, and these are a whole list of components that when you close this, you would click on.
I'm out of time, so I'm going to go quick. Basically what we are trying to do here is create new knowledge and look for ways to explore the network property, the signalling systems that you can't get from the print product, and all these little discrete pieces. You need to look for inter-pathway connections. You need to clear the pathways and look for networking properties, and hopefully use this tool for modelling and integration with other bioinformatic resources.
There are a lot of things that we still need to work on. We are hoping to go out for another round of funding to do some of these things. The scientists that we are working with are saying this is where they see the future. Taking this kind of information and adding these things on top of it will make it easier for them to discover new knowledge and to be more productive. That is our goal, that the information is leading to productivity.
Lessons learned. We think that definitions of a publication is evolving. The researchers we are working with, they have a vision. They understand how this information can work for them, and they are really excited. That is the best part of this. That is what makes it fun, because they tell you what they need. You try to create it, you have to work on it, you have to keep changing and evolving, but you can see the excitement and the people beginning to use it, and it is making a difference.
Efforts at standardizing data input and control vocabularies have been really difficult. NIH tried to help with this. I don't think you are seeing the standards evolving. I think someone is just going to have to get out there and start doing it, and from there, the standards will start to take hold.
The reward system is not there yet for those people who are doing this kind of authoring. For that reason, we have felt that we have had to link that to the traditional mode of publication during this transition, so those people do get seen.
The Viewpoints in Science do get cited. They are in PubMed. They draw attention to the site. Hopefully over time we will see the pathways themselves cited more than just by Science.
I'll stop there.
DR. LYNCH: Thank you, Monica Bradford. Let me invite Alex Szalay now, who will give us a viewpoint from a different science.
DR. SZALAY: This is some of the data that we are currently working with, the positions of 130,000 galaxies in the sky. This indicates that we are dealing with fairly complex data which has both spatial and temporal aspects. This is of course only a small fraction of the sky, so eventually we are going to fill in a whole sphere and to study the spatial patterns of these distributions.
This is just a little application that was written by my son. I am very proud. I would like to talk about why publishing large data sets is an issue. It is work that we have been doing together with Jim Gray from Microsoft Research, who is one of the world's database experts. We have been working together for the last five years.
The bottom line is that we are living in an exponential world. I think this holds not just for astronomers, but for scientists in general. Currently we have a few hundred terabytes, and if we want to cover the sky from the atmospheric resolutions as well as through the atmosphere, to cover the sky in one color, with one waveband, it would be about four terabytes.
But of course, we are trying to do this in multiple wavelengths, so to the ultraviolet and then X-rays sent in radio, so essentially this adds up to a few hundred terabytes. But this is what we have today. Very soon we are also going to start projects which are going to re-observe the sky every four nights, to look for variable objects, to look for the temporal dimension. At that point, we are talking about the data of a few petabytes per year.
The astronomers are looking for new kinds of objects for interesting ones which are the outliers from the typical. We are also trying to study the properties of the typical objects, so for every object we can detect on the sky we derive 400 different attributes, and we study this.
The bottom line is, the data doubles every year. Why? This chart, the blue line shows the square meter of telescope area available to us for the last 30 years, and the green line shows the number of pixels available in detectors on these telescopes. One can see that the growth in the detectors is a factor of several thousand where there is only a factor of 30 gross in the telescopes. These are the things that cost hundreds of millions of dollars each, these telescopes, and it is actually the little devices which drive astronomy.
The data is doubling roughly every year or every one and a half years. Essentially it is growing, and it is not an accident. Astronomy as all sciences operates under a flat budget, so we basically spend as much money as we get from our funding sources, and essentially use this money to buy computer equipment, to get more data. The limiting factor is that we drive our computers as hard as we can, so this is why our data is growing more slow. That is I think a commonality across all sciences, so we will probably see that the same trend is appearing in other areas.
The data in astronomy typically becomes public after about one year, so there is a one-year proprietary period for the people who have taken it, who build the instruments and so on. But then after that it is released. Some people who sit on top of a big project have access for a year to an extra five percent of data, but the bottom line is that everybody sees pretty much the same data that is publicly available.
How do we deal with this? We are really hitting the wall. We can deal with one megabyte a second. We can FTP it over and then we can graph it, so we can use the graph field test and field test through it. We can still do a gigabyte in a minute. We can move a terabyte with Internet speeds in about two days, but when we get to a petabyte, it will take three years to do anything with it. It is on the horizon. And by the way, the petabyte will be 10,000 disks.
So we do need to do something different. One of the things that is emerging is that we simply must use databases. What are databases good for? If we search for something, we don't have to look at all the data; we can use indices to find something in the data. We can also do searches in parallel, which is what you get with using the databases.
Let's get to what is driving all this. We would like to make discoveries. When and where are discoveries made? They are always made at the edges, so this is why people started to go away from the -- when they were traveling around the coastlines of the Mediterranean, and then they started to explore wider. They wanted to explore the unknown.
So this is what astronomers are trying to do; we are trying to go deeper in the universe, and we are also trying to explore more colors, so that maybe instead of three colors, we look at five or seven colors of the same object. We see some funny properties which distinguishes one of the objects from the billions of others.
There is an interesting parallel to Metcalf's law. He is one of the inventors of the Ethernet. He worked at Xerox Parc, and he postulated that the utility of computer networks grows not as the number of nodes on the network, but as the number of possible connections we can make.
The same thing is true with data sets. If we have N different archives, we can make order of N squared connections between those different data sets, so there are more things to discover. So that is the utility of federating data sets.
The current sky surveys have really proven this. We have discovered a lot of new objects, where we combine multiple surveys in different colors. This is also pointing to another thing, that there is an increasing re-use of scientific data. So people are using each other's data for purposes which it was not necessarily intended for.
The data publishing in this exponential world is also changing very dramatically. The roles of the users, or the different roles are still present, so the authors, publishers, curators and consumers, but in the traditional linear world, where the number of journals and the books published in a given year is roughly constant or growing slowly, in the exponential world, all these big projects are undertaken by surveys of collaborations of 60 to 100 people who work for five or six years to build an instrument which takes the data, then run it for at least that long, because otherwise it would not be worth investing that much of their time to do it.
Once they have got the data, they keep accumulating it, and they are going to publish it themselves, because every time they discover an error, they have to reprocess all the data from scratch. So the data is constantly changing. So they put it up in a database and make it accessible in their Web sites.
Eventually the project finishes, and the people are going on to other projects. They do different things, so at that point they are ready to hand over the data to somebody else at the end of the project, to some big national archive or centralized storage facility. Then the scientists are still interacting with all this data.
Why are the roles changing? The exponential growth makes a fundamental difference. First of all, these projects last three to five years, make it six years. This means that the typical age of the data that is pushed up to the national archives is typically there years old. This means that in the meantime, the rest of the data in the work has increased by two to a third by a factor of eight. At any one time, the data on the national data facilities, the centralized facilities, are only going to hold about 12 percent of the data in this world, and everything else is going to sit with the groups of scientists. So this is very different from the linear world.
There is also more responsibility of the projects. The people who are astronomers, physicists, biologists, are learning now how to become publishers and curators, because they don't have a choice if they want to make their data public. This is where I am coming from. So I am trained as a particle physicist who turned theoretical cosmologist turned observational cosmologist, because that is where the new breakthroughs came, and now I am worrying about data publishing and curation, because this is necessary to do my science. To do my scientific work, I cannot do it otherwise.
So we are spending a larger and larger fraction of our budget on software, and in many cases re-inventing the wheel. This is on the other hand not necessarily so. So we need basically more standards and more templates. At least if I had a hard time publishing our data, which is currently many terabytes, at least somebody else who is coming a few years later should have an easier time doing it.
There are luckily lots of emerging concepts which help. One is that it is easier and easier to standardize distributed data. Today we heard about XML; there are Web services emerging, supported on many platforms, which make it very easy to interchange data and even to exchange complex data.
There is also a big trend in making computing more distributed. This is called grid computing, where the computing is distributed all across the Internet on multiple sites, and people can borrow CPU time whenever they need it. But again, the people who talk about grid computing, they only think about harvesting the CPUs; they don't think about the hundreds of terabytes or possibly petabytes of data behind it, because we cannot move the data to the computers. SATI At Home can do it, because SATI At Home can send a tiny chunk of data which needs a whole day of computing, so that is affordable. But if you need huge amounts of data where every byte needs a little bit of computing, it is not efficient to move my data to the computer site. It is more efficient to move the computers where the data is, and also move the analysis where the data is.
This is the concept of virtual data. We give enough computing to the sites which host the data, and somebody wants a new derivative of the data which doesn't even exist at the time in the archive, we just create it on the flight category and discard.
Essentially, we have now an intercommunication mechanism, and a lot of what is done in grid computing actually does apply also in this work, in this distributed work, which is growing in an exponential way. It is also getting exponentially more complex. The threshold for starting a new project is getting lower and lower as the threshold for the hardware is getting cheaper.
I got into this through the Sloan Digital Sky Survey, which sometimes we also call the Cosmic Genome Project. It is one of the first big astronomy projects which is set up in that mode, to map the sky not just to do one thing, but to try to create the ultimate map, what we can do from the ground.
It has a special telescope. It is two surveys. One is taking images in five colors and the other one is trying to measure distances. There was quite a lot of software involved, and at that time when we started which was in '92, 40 terabytes of raw data looked like an enormous amount. Today it seems not so big a deal.
The imaging survey is taking -- in one night it takes a 24,000 by one million pixel image of the sky in five colors. When we finish in 2005, we will have approximately about two and a half terapixels of data. One can see that we take data in these panoramic stripes; each of the stripes consists of 12 scan lines, and then each of those is -- this image is still twice bigger than the screen resolution of my PC.
We are also trying to measure distances so that through the expansion of the universe, we can figure out what the distance of the galaxy is, and for about one percent of our objects we are trying to get very detailed information. So this just illustrates how complex the data. Our data flow is such that we take the data on the mountaintop in New Mexico, we ship the tapes via Federal Express to Fermi Lab in Batavia, because the networks are simply too slow to shove around that much data. Then we process the data and then put it into a sequel database, which currently has about 100 million objects in it, and it will be about 300 million when we finish.
Then we build around this as an experiment. This started out as a hobby between Jim Gray and me, so we did this in our spare time. We were wondering how could you present this complex data to somebody like a high school student. At some point we looked at each other and said, let's not talk about it, let's try to do it.
We built a Web site. Eventually we got a lot of help from various individuals who volunteered time and filled it up with educational content. We opened the Web site up almost two years ago now, and after two years we have about 12 million pages total, and it is growing exponentially, so we are now at about one million hits per month. This is used by high school kids who are learning astronomy, and also learning the process of discovery, using up-to-date hot data, so this is the best data that also any astronomer can get today.
It opened up a whole other can of worms that we are just starting to appreciate. We released the first chunk of data, which was only about 100 gigabytes, in June 2001. What we realized is that once we put our data, people started writing scientific papers on it. We are putting out the next data release about this summer, which will be now close to a terabyte, but we still cannot throw away the old data set, because there are papers based on it. Somebody wants to go back and verify the papers, so whatever we put out once, it is like a publication; we cannot take it back.
This means that once we put out the DR-1 data, which is one terabyte, and then a year later we put out the DR-2, we still also have to keep this terabyte when we put out the next two terabytes, and so on. So this basically goes on.
This publication process is ordered and scragged, so we don't always just throw the last and best version of the data in this work. We have a yearly release and the total volume that we have to keep up is like a different edition of the book, except that the book is doubling every year. So this is what we are faced with.
It also brings up rather interesting things. Once these projects go off the air, then these databases will be the only legacy of the project. Where is most of the technical information today that we deal with? It is going on in electronic communication e-mails.
We also have to capture, for example, all the e-mail archives of the project that we have archived. We shouldn't delete them, but actually check it into the database, because this will be the only way that somebody ten years later can figure out what did we do with the subject or this part of the sky.
This is a Jim Gray slide, who explained it to us when he did a physics colloquium: why is astronomy data special. He said it very profoundly at the physics colloquium, that the astronomy data is very special because it is entirely worthless. He meant this in a complimentary and good sense. He works for Microsoft, so he doesn't have to sign disclosure agreements and bring in the Microsoft lawyers if he actually wants to play with a bit of astronomy data.
So there are no privacy concerns. You can take the data and also give it to somebody else. It is great for experimenting with algorithms. It is actually real and well documented, and it has a lot of relevance and a lot of similarities to real-life data. It is high dimensional, it has errors, confidence intervals, it is spatial and it is temporal. So one can do all sorts of exercises that one has to do with real-life data, without being sued.
It is diverse and distributed, so we have now many different instruments that are constantly observing the sky in all the continents, essentially. The questions are interesting, and there is a lot of it. All these objects on the side are the same galaxy observed in different wavelengths. You can see that they do contain information, all these different observations.
This all adds up to the concept of the virtual observatory. We were struggling with this data publication process. It is a small survey. Our friends at Cal Tech were doing the same thing for the two micron key survey, and a bunch of people at the space telescope in Baltimore across the street from us, at Goddard Space Flight Center, so at various places we were all doing the same thing.
At some point, we got together and decided that it is much better to try to do it all together, because eventually, as soon as each of us are ready, then the astronomers will say, okay, why doesn't it work together, how can I create all of it together.
So our vision is, when we created the concept of the virtual observatory, that we will make it easy by providing also some standard interfaces, how we can federate these databases without having to rebuild everything from scratch, and also provide templates for others, for the next generation of sky surveys to actually build it immediately the right way.
It has taken off. About a year and a half ago, we got an NSF ITR project which is called building the framework for the national virtual observatory, which has all the major astronomy data resources in the U.S. involved, astronomy data centers, national observatories, supercomputer centers, universities, and also a bunch of people from various disciplines, from statistics, from computer science.
It has taken off rather nicely. We have built already some demos which led to some unexpected discoveries. It is also growing now internationally. Basically this effort is now being copied in over ten countries, including Japan and Germany, Italy, France, U.K., Canada.
Today, all these projects are operating roughly with a funding of about $60 million, and there is really active cooperation. Just last week there was a one week-long meeting in Cambridge, England about the standardization efforts, what are the common dictionaries, what are the common data exchange formats, how to do common registry formats, which are OEI compatible, and so on. There is now even a formal collaboration which is called the International Virtual Observatory Alliance.
I would like to summarize here. Publishing this much data requires a new model. We are not sure what this new model is, so we are trying to learn. There are multiple challenges also in the use of the data for different communities. There are certainly publishing challenges. There are data mining challenges. There are visualization challenges, how can we visualize such a large distributed complex data. There are educational aspects of it. It is also a Web services poster child, that this is how you exchange information with one another.
It is also deep at heart the information at your fingertips, so students have now the ability to see the same data as professional astronomers. There is more data coming, very much more data is coming. So I think by the most pessimistic assumptions, by 2010 we will get petabytes per year.
I think astronomy is largely following particle physics with about a 15-year time delay. Particle physics was also growing through this evolutionary phase. It is probably going to last for the next ten or 15 years, until the next telescope will be so expensive that only the rich can afford to build one, like it happened with Zuring and the Biedersteck accelerators today. But until then, we will live through this data doubling.
The same thing is happening in all sciences. We are driven by Moore’s law. In high-energy physics and genomics, cancer research, medical imaging, oceanography, remote sensing.
I think I would like to conclude that this also shows that there is a new emerging kind of science. When you think about how does science start, it basically was very empirical. It started with Leonardo, who did these beautiful drawings of turbulence, and described the phenomena as they have seen them. Then through Keppler, through Einstein, people wrote down the equations that captured in the abstract sense the theoretical concepts behind these and provided a simple description, simple analytic understanding of the universe around us.
Then when we dealt with systems consisting of a billion galaxies, we could only do complex computer simulations. Then a computational branch of science emerged over the last 20 or 30 years, and what we are faced now with is data exploration, so we are generating so much data, both in simulations and in real data, that we need both theory and empirical computational tools, and also information management tools to serve us as scientists.
Thank you very much.
DR. LYNCH: Thank you, Alex Szalay. I will turn it over now to David Lipman.
DR. LIPMAN: Cliff, I want to compliment you on your very gentle way of giving the hook, this very soft step he takes. Also, as the last speaker before lunch, I'm going to go easy on you and actually try to finish before Cliff gives me the hook.
Recently I met with Jim Gray from Microsoft Research and Alex Szalay, and they spent a day -- actually, Jim was with us for about a week at NCBI, and we discussed a number of the similarities and differences between biomedical research, biological research and astronomy. It was very stimulating, and I think you are going to see some similar things in my presentation.
Most of you are publishers, and you know that over the last number of years, there is change in publishing. Some of that change is because of the Internet and the Web. There are different ways of getting information out there. Some of it has to do with economics and other things that are going on, and other sorts of movements as far as who owns what. But I think one of the driving forces for most scientists is that science is changing, and since I am a biologist, biological research is changing. When we say that it is becoming more data intensive, it means that a given researcher is generating more data for each paper, but they are also using more data from others for each paper.
So that has an impact on both the factual databases, which I think represent knowledge and data, and the literature, which primarily represents the knowledge. I think that many scientists are feeling, myself among them, that in order to make this work, we are going to have to have deeper links and better integration between the literature and the factual databases in order to improve retrieval. That would be retrieval not just for the literature, but also from the databases, and to improve the actual usability, how much we can extract out of the value in both the databases and the literature. Finally, I am going to try to make the point that the quality of the factual databases is very much affected if one can get a tighter integration between the literature and the databases.
So let's look at some data here. This is not exactly a measure of growth in the number of papers, but it is certainly growth in the number of papers in PubMed and Medline. You can see that increase is substantial, but that it is basically a linear increase.
If we look at the number of factual databases, if you look at for example the number of proteins that are available, that is going up exponentially. If you look at the number of polymorphisms, that is complicated how that is going up; it depends on how one integrates that also. We are now just starting to get a lot of polymorphisms for other organisms. We are going to get many of them from microbes soon, and that is going to be quite a huge amount of data.
If you just look at Genbank, both the increase in the number of sequences and the number of bases, that is going up exponentially. The doubling time is not quite as rapid as what Alex Szalay is talking about, but it is exponential, maybe one and a half years.
A point I would like to make is that we are not only getting more data generated because of certain high throughput centers, but if you just subtract them out and look at your average laboratories, not the centers that are really meant to be generating this huge amount of data. You are seeing that we are getting a lot more data from them as well.
So for example, if you take all the bulk submissions out of Genbank and look at the rate of growth, you see pretty much a similar pattern, although where on the Y axis changes.
Here is an example of a post-genomic type of research that generates a lot of data, proteomics. If you look at the earlier studies and the number of identified proteins, we see that with more recent studies, the amount of data generated with each paper is actually increasing. These are not huge centers. In fact, most universities now have a variety of proteomics cores doing mass spec, generating a lot of data.
If you looked at expression analysis, the kind of work that Pat Brown has been a pioneer on, you see the same kind of pattern. There are many more labs doing this work, but any given lab is actually able to, as the cost goes down, generate more data points per paper. With expression analysis for human, you may be using a chip that is monitoring the expression of a few thousand to maybe 20,000 genes, and you may be going over a range of conditions, so you are easily generating hundreds of thousands of data points that are relevant not only to make your own point in a paper, but also to make that data available so that others can learn new things from that data.
The other point is, not only are scientists generating more data, but they are using the data generated by other people for their own research. So if we look here, this is the number of users per weekday at our Web site, where we are over 330,000 different IP addresses, so that is somewhat of an under estimate and the number of hits per day. This is growing. They don't just use it and forget about it. They are using it to design experiments.
If you look at a number of the electronic journals that are out there and see the number of links or the number of identifiers from databases people are including in their papers, it is quite large; with supplementary data files you see the same sort of thing.
Let's just think a bit about what is going on, for example, in a typical functional genomics experimental pathway. It could be expression analysis, proteomics, there are others. To do the experiments at all, generally you require a range of genomic data to set up to do the experiments for expression analysis. You need sequence data from a number of transcripts or genomes in order to design your chips. For proteomics, to interpret the mass spec data, you need to be able to compare against a number of genomic data sets.
But another important thing that comes out of it is that you are going to generate often a new kind of data with expression analysis. These are expression levels, what is going on and off in different conditions. With proteomics, it might be various proteins that interact. But you are also going to find data which is relevant to the very databases that you were using to design your experiments. So for example, if you are doing proteomics in a microbe, you may now get mass spec data that confirms the existence, or that certain proteins are actually translated and are being found in the cell. You would like to feed that information back to these databases, so that people now know that some genes which were hypothetical, you now know that in fact, they are translated or expressed and so forth.
Here is a picture. This sort of cycle here that goes on is really critical. A related point here is that if you are generating this data and you just have it as supplementary data files that are sort of lying on somebody's server or your own server, then it is not going to be structured correctly. It is not really going to be as useful as it could be if it is integrated directly into public databases, where the data is structured, normalized and so forth. So we are in a transition period, for example, in expression analysis. Some of the journals are just beginning to require submission to various archives.
So I think it is very important to keep in mind this cycle. This affects I think the way the literature is going to be presented. I'm going to give you some examples of this from work that we are doing at PubMed Central, which is NIH's archive for the biomedical literature.
If one does a query, as a typical egocentric scientist, I use my own name as the query. Just look at the links that might be in a given paper. Here, one link is cited in PMC, so I can see who cited the paper. The link back to PubMed, the reference to articles. An advantage now, because there is a number of links and other things one can do, actually computations if you will on the records in PubMed, you can click from an article and get all of the referenced articles that are in PubMed from it.
There is a link here that says taxonomy. One of the things that we can do, now that we get the full text, we can computationally go through the text and look to see matches to known organism names, and if they appear in certain places in the article, we can be more confident that this was relevant information with respect to that organism.
In the paper that I was looking at, it was comparative analysis to look at gene loss in saccharomyces with respect to Schizosaccharomyces, and you can see down here the fungi involved, and we were able to infer gene loss in saccharomyces by finding closer matches between Schizosaccharomyces with organisms outside of the fungi.
So an important point here is, we are not just pulling off terms, but we are pulling off the terms using certain algorithms and then matching them back into a taxonomic tree, so that we can do other kinds of queries.
So for example here, if we go to our taxonomy database and go at the level of proteo bacteria -- remember, we have a tree here -- we can look at all the data elements that we have from various databases that are either at that level of the tree or below, or only mentioned at that level.
So for example from PubMed Central, we can see that we have almost 30,000 papers that we could link up and say that they are about the proteo bacteria, but if we are interested in papers where they are talking about specifically at the level of proteo bacteria, not one of the species below that, then there are far fewer papers.
So when we link over to PubMed Central from that, it doesn't do it as a text term; it does it as a link over from a particular database record from where that is in the taxonomy. Again, you can see that now we can do more exploration. In fact, I could click to see the nucleotides. Now you can get information on a sequence that was referenced in a given paper that looks almost like a figure in the paper, except for it is live and you can extract information computationally, if you want.
These are just papers that were cited.
One more example here. If I do a query on methylase in isoprenylation, I got a variety of papers from PubMed Central. Here is a paper that links to a whole variety of databases. If we look at the paper itself, which was in Molecular Biology of the Cell, we can see there is this link to domains, and those are conserved protein families.
The point of this paper was that they analyzed particular genes in Saccharomyces, and they aligned them with the related proteins from other organisms. They were looking to see the conserved regions.
This is a figure from the paper. It is not very easy to read that, but by looking at the conserved regions they would then go ahead and mutate those and see if they could knock out function. It is a common thing that people do.
However, because we could link this to known conserved domains, instead of being dependent, if I am a reader, on trying to puzzle out what was in a static figure, we can link to our domains database, where we have this protein family aligned, and you could actually look at the figure and see exactly where the conserved residues they were talking about were in this. But now we can control what range of organisms we are going to have in this alignment. If there were new proteins that were in the family, you can see them. If there was a known protein structure that was in this family, you could see this superimposed with that protein structure. So by having this fairly fine level integration between -- a deep link between the literature and factual databases, you value the papers higher, because not only can I understand what was in the figure, the point they were trying to make, but I can go beyond their point and look at newer information that was available.
I'm probably going to mess up here, but there is a project we are involved with trying to curate protein families from the point of view of structure information, conserved sequence, phylogenetic information, distribution in the tree of life and so forth. It is a very powerful approach.
I'm going to give an example. People are very interested in this. There are lots of papers written about the evolution of protein families from the point of view of structure and function, because those three things are really well connected. It helps you understand how these things work, how they got that way, and also how to intervene with drugs and so forth.
Here is a publication on one family, the lipocamen family, where it is widely distributed in nature, many organisms quite evolutionarily distant from each other, but also there are different subgroups in this family that actually do different things.
Typically you see in these papers these multiple alignments and pictures of structures that are superimposed. There is another one which is a subset of these families, and so they are trying to explore them in different ways. You can get nicer figures because it is electronic than you could. Then here is the distribution in various subsets, the evolutionary trees representing these families.
Unfortunately, this is not very easy to get through for somebody reading the paper. If one were able to link to this family in the conserved domain database, or actually project the sequences you were talking about into that, then we have a tool, a helper application, that links to the Web that many people use.
This is the same family. You can look at this much more easily this way and sense where things are. In fact, I have the sequences here as well. If I take this particular conserved block of residues, I can see where they are here. It is much easier. If I want to break down the family to look at each protein and see how close that -- we can now see structurally how conserved, in terms of how much they move, those residues are.
Many tens of thousands of people have downloaded this program. This is not something that is a sophisticated program, but it runs on lots of computers, so that is not -- I'm nervous downloading it on Powerpoint, but on the Web it works fine.
What I am not able to demo here, but we will be including, is that the tools to simultaneously look at the phylogenetic relationships of these proteins and to look in the distribution.
Anyway, without going into any detail here, the very distant group that we saw where we just had that little core, that were proteins that are actually distributed among all animals. If we were to look at a different subgroup here, this subgroup which is actually a fatty acid binding protein, you can see much more of it is conserved. I could get that up and spin it around, but I'm not going to do it.
Thank you very much.
DR. LYNCH: Thank you, David Lipman. It is terrible to keep time for these people, you understand, because all of this stuff is not only fascinating, but seems to invite deeper probing. You want to get your hands on it yourself and explore in many cases. But we wanted to try and keep this more or less on schedule, to allow primarily time for you and for your questions.
While you are gathering for those, let me just ask our three panelists if any of them have any immediate reaction to the other presentations they heard that they want to take a moment to share, or any follow-on point they want to make there.
You're not obligated to do that. The last session they didn't do it, either.
DR. LIPMAN: I will make a point. I think from most of what I have heard, most of these sessions people haven't been poking each other too hard, so this is a weak poke.
I think one of the problems that I heard with the STKE is because it is proprietary, because you need to have a subscription, people don't link into it as much as they would. I think that one of the issues with the factual database is, if we are going to integrate across many factual databases, using various organizing principles in astronomy, space-time coordinates, in biology we don't have quite as powerful or as general organizing principles right now, but we have some. I think it is a limitation when you have some of these resources that not everybody is going to be able to get into. But it would be interesting for me to hear from Monica Bradford on how they deal with that issue.
MS. BRADFORD: Well, actually the connections database is free. Anyone can use it. The other parts of the site are behind a subscription wall, the parts that are more your typical journal type items. So one of the ideas is to see if you can combine some ways to support the whole combined environment while keeping the database itself free.
All the authorities, that is what they want. The value is from it being freely available, and it has always been freely available.
DR. LIPMAN: Is it linked in? Every day, how many refers in do you get from various other resources?
MS. BRADFORD: I don't know the number. I quoted some number about how many of the pathways were viewed each month. I don't know how many link into it. 10,000 connections, not pages, are viewed each month. It has been a slow process to grow it, so we are only at a point now where we are getting to the point, we have about 50 pathways, so we haven't gotten to the point of scaling up. It is a little early to say how much activity you are going to see, but I think it is going to grow a lot.
Just one other thing that we have thought about. We do believe that the data will have value for drug discovery. So one of the thoughts along the way is, is there a way you can license it for use by pharamceuticals so they can integrate it with their own databases behind their firewall, with the idea that that would help support the overall site and allow it to remain free for academic use.
So far, most of our experts, authorities, are comfortable with that model, but I would be interested if anybody has any thoughts about that.
DR. LYNCH: The microphones are open. I don't know if we have any questions by the way from our colleagues on the Web, but if there are questions out there, please forward them in by e-mail. Dan?
DR. ATKINS: These three excellent presentations were very representative of the 80 or so exciting testimonies that we had as part of these cyperinfrastructure investigations. I just want to use them as the occasion to say a few words to help make this a little more vivid, a little more imperative, and to try to encourage this community to get behind NSF really doing something bold in this arena.
First, this exponential data world that we are moving into is certainly present in astronomy, it is present in many other fields. It illustrates that the challenge and opportunity includes going to higher and higher performance networks, higher and higher speed computers, higher and more capacity storage, but to do that also together with another dimension that I mentioned this morning, and that is functional completeness, having the complete range of services. So the challenge is moving up into this two-dimensional space that involves this balance between increased capacity and increased function.
A second point that these talks illustrate is the real exciting potential for multi use, the fact that the same underlying infrastructure is serving the leading edge of science, it is also serving making learning of science more vivid, more authentic, more exciting, all at the same time. It is multi portals situated to the right audiences that you can build around these exciting environments.
Also, the astronomy example illustrates, although a major central or distributed investment is needed to create this infrastructure, once it is created, leading edge teams or individual amateurs, provided they are given open access to these data and to the tools, can make seminal and important contributions to science.
So in all of what you have heard today, there is potentially, particularly if we can get more into this open access realm, a democratization, an increase in the diversity and the inclusiveness, so it is part of the bumper sticker from the cyberinfrastructure that says it is affecting what is done, how it is done and who participates.
Finally, it illustrates in our judgment the urgency for getting on with some leadership in this area. These communities are off doing this; they are scraping together the resources. We have cosmologists that are becoming data curators and on and on. People are putting extraordinary efforts into this, and that is very commendable.
On the other side, if we don't get the right investments, and we don't get the right synergy between domain scientists and librarians and information specialists, we could end up with sub-optimal solutions, solutions that don't scale, and worst of all, we can end up with Balkanized environments that don't interoperate and pay enormous opportunity costs.
So this is a commercial not for the Atkins report. It is the commercial for trying to give, from whatever advantage and point of influence you have, the NSF and the NIH and the Department of Energy, to come together in maybe a rare sense of cooperation and try to create the underpinnings of this kind of environment that is going to allow this kind of thing to happen and prosper on a grand scale.
I also hope that you have inferred more explicitly from what has been said here the enormous need for the role of the information specialists, librarians, all the various labels that are represented in this room today.
DR. LYNCH: Would any of you like to comment on that?
MS. BRADFORD: I agree. I think one other aspect that didn't surprise us, but has been a nice benefit with SDKE is, AAAS does have an NSF grant related to what is called Biosite Ednet, which is trying to take online resources and make them available, or make it easy for instructors to find them to use them in their course work. We have been able to extend the use of STKE by adopting the same metadata that is being developed for that program.
We have now extended our audience from what was primarily the researcher, that was our initial focus, to provide the tools and information they needed, but we are finding that it is also having an incredible use in education in undergraduate courses. So we are working to develop pieces of it now that are tagged with the metadata of that project so that they will be searchable through that portal.
DR. RESNICK: I want to ask about the quality control or review process for the data that gets into these databases. Can anyone put their data in?
DR. SZALAY: Typically in astronomy, what is emerging is that there is a threshold. The cost of putting your data online is that you provide adequate metadata in certain formats. So this keeps a lot of people out whose data is also of not high enough quality, who haven't documented well enough the data acquisition process.
But beyond that, we thought a lot about this, and we didn't want to institute a formal refereeing process. We will probably introduce an annotation system eventually, where people can feed in comments and annotations. But right now I think it is just this threshold of documenting well enough.
DR. LIPMAN: In the databases that we have at NCBI and I think typically for the life sciences of the molecular type data, you have databases like Genbank, where authors directly submit sequence data which is a sense into an archive. So when that process started up 20 years ago, direct office submission, the sense was that people were going to put in all kinds of make-believe data and so forth. Actually, that doesn't happen.
But the other thing that does happen is, you have data which is redundant, some data which is redundant, and some versions are more high quality and others are low quality.
So that has given rise to related databases, which are curated, some by expert groups on the outside, some by the databases themselves. So for example we curate the human sequences along with some others, mouse sequences, and we have data sets, a reference sequence set, where we work with curation groups and independent databases to put together a comprehensive set of curated sequences.
So in general for the life sciences, for the molecular data, you will have two related sets, the archived set that represents what scientists provided at the time, and the curated set, which represents to the best of our knowledge what is true now. So those two work together well, because sometimes what some experts think is correct now turns out to be incorrect and something that was already there may be more correct. So you have the pointers to the older versions.
MS. BRADFORD: We are more of a curated database. We have a point in time where the pathway and all the related components are externally reviewed by people outside of the group of authorities that created it and did the data entry.
But that is a snapshot in time. Those pathways and data entries can be constantly updated. So right now, the only ways that someone can comment during the period between the formal review is either by directly sending an e-mail to the authority, which you can do right from the graphical interface, or do the feedback function, in which case everyone sees it.
We do think that some additional tools to improve the review process and also to allow for more community annotation over time, as long as it is clear what is community annotation and which is the official part that has gone through the review process. We think a combination of the two things will add the most value over time.
DR. LIPMAN: One other point that relates to the question, which is somewhat of a difference between astronomy and biology. Astronomy has an organizing principle, in that there is a coordinate system, space-time coordinates, which is largely agreed upon. Although there are a number of different coordinate systems that are used, they can be resolved largely to one.
So given that there is a stronger theoretical base in astronomy because of physics and one has a natural organizing principle which is quite profound, there is the ability with a variety of computations to actually assess some aspects of how good the data is.
In biology at the level of the genome and transcripts and proteins and to some extent protein structure, there are organizing principles that are natural and strong enough that you can detect things that don't make sense. The whole thing fits together in a certain way, and as you get more transcript information or more comparative data for protein and so forth, you can see how things fit together. But above the level of the genotype and to some extent protein structure, with expression data, pathway kinds of things, proteomics and so forth, we don't really have natural organizing principles that allow us to do that.
One of the difficulties of project like STKE and any of the other projects that are functional genomics is that it is much more difficult -- I'm not saying they don't exist, but it is much more difficult to use cross validation in various ways to convince yourself of the quality of the data. o a challenge is, we have high throughput methods in biology that are at the level of function, will be to deal with quality.
MS. BRADFORD: I would say too that one of the reasons we took the pathway we did is because all the tools aren't there. People are trying to make connections across different disciplines. They are coming into something, maybe they are a cancer specialist and they find an oncogene and it turns out it is in a signalling pathway. They may be looking at information that they themselves have not -- literature they haven't been following or data they haven't been following.
So we think the value of having science and university libraries and our editorial processes associated with it is that it does help build some trust in what they are seeing when you don't have the more automated tools that David is talking about. Then you can at least link to what is established in the archival and curated databases elsewhere, but at a certain point when you are pushing the edges and trying to gain new knowledge, you have to figure out what you are going to trust. That is what we hope we add.
DR. SZALAY: Can I add a quick thought also to what David Lipman mentioned? It brings up another interesting issue. If one has -- so as the data is growing exponentially, over a period of time it grows by a factor of 100, and our computers are also growing by a factor of 100, if we have immediately an organizing principle that we can put every bit of data in place we find, we can keep up.
If we have such a problem that we have to connect every bit of data to every other bit of data which is N squared, so we have 100 times more data, our computers are 100 times faster, but our computational problem is 10,000 times larger. We are in a spot. And we are starting to approach this.
DR. LYNCH: Marty Blume, I think you were next.
DR. BLUME: I'm struck by the connection between a number of things that have been discussed here and things that have happened in the past. It is useful to look there to see how we got here, and perhaps to extrapolate with two points into the future.
First of all, if I go back to the discussions of a few days ago, the e-print archive grew out of the pre-print archives in print back in the 1960s that followed the Xerox revolution at that time. Looking back particularly from astronomy and looking into particle physics, there were experiments done in the 1960s with bubble chambers, and there were many photographs taken of events. Each time a beam came in, they would take photographs of it.
This data itself represented the equivalent of the sky survey, because it could be used not just for one discovery, but for meta experiments that were done on that. In fact, I see the same thing in the astronomical data, where you can re-use this for many different experiments or analyses that yield new data into the future.
But it did come from the past, and I can imagine meta experiments being done on all of the data here that is collected by other people who do real experiments, that is, in a laboratory. But then the analysis can yield many different and new things.
That data was available freely after awhile back in the 1960s, and I gather that you keep the astronomy data for close analysis by those who have actually taken it, and then make it available later on to others who can do with it new ideas and new experiments.
I see this as a very strong future. Of course, the electronic means this changed dramatically, but it still looked like a massive job in the 1960s. I remember one paper where the omega minus was discovered, and it had, horror of horrors, 33 names in it; 33 people had collaborated in this. It was a sign of things to come. We can extrapolate from that into the future, too.
DR. ROSENSTEIN: Linda Rosenstein, University of Pennsylvania. We have licensed electronic journals from probably ever STM publisher in this room. We also have scientists who are working on some cutting-edge stuff.
In fact, from our genome institute, we were asked a question, and I would like to read the question and get the answer. This relates to what Dr. Lipman was talking about.
How might we automatically download, create, and/or centralize a repository of identified articles? Our intent of use is to extract data and republish subsequently extracted facts. As the extracted data would in no way resemble the original text and would also have pointers to their sources, we don't believe this is a significant issue.
Now, these guys want to do something that is quite new in the way of science, and they think it is very important. They are in cancer research. We know we have all these restrictions on the data and the use of the data that we have licensed for our university researchers. So how do they get over the hump of getting into this incredible data mining process which presumably will have great results in science and in medicine perhaps, but we are still living with a different kind of model of how the scientific literature is currently available to us.
DR. LIPMAN: That is a good question. I wasn't here to hear Pat Brown’s presentation, but obviously the related point to my presentation to Alex Szalay is that there are issues in terms of, is the current model of fee for access fitting how the literature and the factual databases want to be used.
You may be referring to research which is text mining work to try to look at relationships among genes or whatever, based on looking at papers. I'm not a big believer in that work, though it may have promise. I got a slide from one group who does that because I wa thinking of showing that.
I think that you don't even need to imagine that that is going to be useful, but you can look at more the kinds of examples I was giving, where you actually have organized data sets associated with the literature that you would use the two together.
I think it is an issue. I think it is one of the forces that publishing is going to have to contend with. I don't know where we are going to be five years from now or how we are going to get there, but I think that there is tension between the current model and how scientists want to work.
But I don't know the answer as to where it is going to go. PubMed Central is something where publishers volunteer to participate. They are not necessarily giving the newest versions of their information into it, but we have had requests from scientists who want subsets or entire sets of PubMed Central data to compute on locally, to make some of these kinds of discoveries. Some of the publishers that cooperate with PubMed, that participate in PubMed Central, upon request allow us to provide for download this information.
As we start to get more quote open access journals where they are explicitly following some policies of open access, we will be able to provide that stuff for download automatically, like we do the sequence data. But I think that the more literature is open access, then new directions we are going to see, in terms of how the literature and the data sets are used and how they are developed in the first place.
We interacted with some radiologists who were interested in creating a special database of radiological images that would be useful for education and training associated with the journal. Clearly, if the journal is open access, then it becomes a very natural thing to put the two together. But I think that this is a challenge that publishers are going to have to face, and the community.
MS. BRADFORD: I would just say that I think the tension is good. I think it has helped us to think about what we are doing and the basic goals we have.
As I stated in the beginning of the presentation with STKE, we really want to help researchers be more efficient. We are supposed to be advancing science and serving society, and I think STKE has the potential to help do that.
So the tension makes us think again and rethink and try things and experiment. It is an evolution. I think without the tension and having both ends of the spectrum working at the same time, eventually we are going to find a middle ground that will survive. I think that will be good, and it won't totally be dependent on government control or government funds. Hopefully we can come up with some creative ways to do this that will work in the marketplace.
DR. COLLINS: Thank you for three very exciting presentations. Your fields are not mine, and it is wonderful to see what you are doing and what you are trying to do in the future.
My question is somewhat related to the prior one. Presumably the methods for organizing and labeling these huge data sets reflect current knowledge in the field. Does the availability of all this wonderful data to so many people enable researchers when they need to, to take quantum leaps in their knowledge of phenomena? Or is there a danger or a risk that because things are organized according to what we understand now, it might be tempting just to stay in that vineyard and not make big advances or face big changes in modes of thinking?
Do you have any thoughts on whether that is an issue, how it might be addressed, particularly for disciplines that aren't as far along as you are in putting together these huge data sets? Thank you.
DR. SZALAY: Maybe I will respond first. I can speak for astronomers. In the Sloan Sky Survey, which is now about 40 percent ready, after two nights of operation we found six out of ten of the most distant quasars in the universe. Now after about two years in operation, we have ten out of ten. So this is a small telescope in astronomy standards, so this is a two and a half meter compared to the ten meters. The fact that we are so efficient in finding new things by organizing the data and scanning through the outliers I think already shows the point.
So I think it is very clear that this is efficient and good. Whether this is the only way to actually store the data, the answer is clearly not. With these large data sets, I think the only way to make sure the data is safe is geoplexing it. So storing it somewhere else in the United States or somewhere in the world, if we store it somewhere else, we may as well store it differently, in a different organization, so that it gives a different type of view of the data, as long as we can recover.
So I think that is the answer. If we organize our mirrors slightly differently, each enabling certain type of things, and then redirect the queries to the most appropriate place, might be a nice answer.
DR. LIPMAN: So for the databases like Genbank and most of the factual databases in the life sciences, there are multiple sites that have those data. For Genbank there is a center in Japan, there is a center in the U.K. and a center in the U.S. We exchange data every night, but we have different retrieval systems and different ways of organizing access into that. Furthermore, people can download large subsets or the entire -- and do whatever they want and make commercial products. So there is many ways to get into that data.
So I think that with the literature, if we started to have more and more of that open access, and there were multiple sites that had provided comprehensive access, I think you would see different models there. We are working with yeast in France to be able to set up another archive for PubMed Central, and they are going to be organizing that data in a different retrieval system, and we have some discussions with other groups as well.
So I think that when you have open access to the data, then you can have multiple archives that are comprehensive, and you can also have different ways to get at that information.
I think that you are actually referring to a deeper issue. If we have organizing principles like ontologies -- in biology a lot of people talk about the gene ontology as a way to make it easier for people to see what is there, and to move around and to understand the information, there is a tradeoff between that as a representation between what we understand now and the kind of new discoveries that Alex Szalay refers to, which change that.
So I think there is a tension between reductionism and computing from the bottom up and learning new things, and being able to say this is what we know right now, and having that superimposed on top.
Right now, I would say that there is a huge amount of interest in ontologies in biology, and I think some of it is misplaced. One of the reasons why people went the molecular route was that we really didn't understand enough from the top down. If you look at proteins or genes that are involved in cancer, you find those that are part of glycolytic pathways and so forth.
So I think it is not clear to me at some level how much these ontologies assist you in finding things and understanding things and how much they obscure new discoveries. But as long as we youth homicide access to the data in a very open way, people can download and do almost whatever they want with it, then if they want to ignore something like an ontology, they can do that.
So I think it was a very deep question you are asking, particularly for where we are at now in the life sciences.
MS. BRADFORD: I would agree with what David said. I think he has hit most of the points, but I would just like to make a few observations that are not quite as global, but based on the experience with STKE.
The research in signalling started out looking at things very literally. You followed it down one pathway, and soon you realized these were really networks, that you couldn't just think about one pathway; you had to think about all these pathways together and how they intersected and how they affected each other. You couldn't do that in print. So the ability to begin to database and to be able to represent these and then query across the pathways.
So we hope is that we will be able to develop bioinformatic tools that will actually let you look for inter-path connections and find new knowledge. Eventually we would also hope that perhaps we could even download and play with it and add in your own information to see if it changed the effect or had an effect on the pathways.
I'm not a scientist, I'm probably not explaining this quite as well, but I do think that these tools -- the information is just so vast, you need tools to help you organize it. Once that is taken care of, it gives you more of a chance to begin to think about new things and to look at it differently. So now we have gone from the networks, and now we have to think about systems and signal transduction. I think that you will find the same thing in a lot of other disciplines.
DR. LYNCH: Just as a followup on that question, another striking example of that is the migration away from surrogates to full texts that you can compute on, and that is having a radical effect in many, many fields. Being able to find things by doing computation and searching on the full text of scholarship, as opposed to locked into some kind of a classification structure for subject description or whatever, that some particular group used in creating surrogates.
That is a theme I hear from scholars in every field from history all the way through biology. It is really quite a striking example of how things are changing.
In the back.
DR. BANKS: Marcus Banks, National Library of Medicine. I have a question that hopefully that hopefully connects to both panels from today. If we are moving from publication as product to publication as process, should we make a similar transformation in archiving? Or should we still want to archive products that maybe are extracted from that process?
An example in STKE would be the Viewpoints from Science, that are snapshots at the time. I am of two minds about that. It seems that archiving the best in a separate subset would be useful, but then that might be paper bound and the old way we have done things. So I was just curious about -- I don't even know how we would archive the process, so just some thoughts about that.
MS. BRADFORD: We have actually been talking about that, how we could -- because it is constantly changing and the database is updated, and can be on a regular basis, we assume as it scales and grows that will increase a lot. So do you archive on a certain schedule, and get the -- I guess you would call it a snapshot of the database at that point in time, so say every quarter.
I don't know yet what would be the right amount of time to do that. The Viewpoints don't have to be paper bound. It is just that right now we are in a transition stage, and these authorities still want to get credit and want to be recognized for the effort. It is a lot of effort to do these if you are going to do it well. So the purpose for the Viewpoint is not so much archival, although it does serve that purpose. It is more to give the authorities some recognition.
DR. ATKINS: Since that was intended for both panels, Rick Luce actually hit on that a little bit when I was trying to push him off the stage. Let's use the term digital objects. There is an enormous stream of digital objects that could be created by knowledge discovery processes that are mediated through technology, let me just make it that abstract. So you need to be able to archive not only these objects, but the relationships between the objects and the social context in what they are created.
Now, the most far out -- and I think the answer is yes to your question, we need to start thinking about archiving that, including all these temporal streams and so forth that come through. One of the furthest out, most profound ideas about all of this comes from John Seely Brown, who says that it may turn out that the most important aspect of this technology mediated way of work is not just relaxing constraints of distance and time and enhancing access and so forth, but actually comes from what you are saying, the possibility of archiving the process, not just sampling the artifacts along the way.
In areas of ubiquitous computing -- and in everything I am saying, there are huge issues of appropriate use and all of that that I'm not going to get into, so this is a really far out thing. But the idea that people could then come back after the fact and actually mine processes and extract new knowledge that otherwise has been left on the table. So it is an extension of this whole notion of data mining into knowledge process mining, so it gets very abstract. But I think we can start to see that it is not just fanciful, and it is something to think about.
But I think people are interested in long term preservation need to consider huge repositories that take into account not only the basic ingredients, but the social processes by which these ingredients are encountered.
DR. KING: One of the things that has happened over a period of 15 years with science is that scientists have increased the amount of time that they devote to their work, about 200 hours a year. The point being that scientists are reaching an individual capacity as to how much they can devote to their work and this kind of thing. Most of that 200 hours, by the way, is devoted to communicating.
We find that the number of scientists increases about three percent a year, which means that it doubles about every 15 to 20 years or so. We had data this morning that says some of the information we are gathering doubles every year.
It seems to me that one of the reason concerns is the limitation of the capacity of the human intellect to work with these data. Maybe I don't understand the data sets that we are talking about, but it seems to me that the Academy and NSF and others have got to begin to focus on trying to increase that number of scientists who work with these data and the infrastructure that can help work with these data, to begin doubling every ten years or maybe every five years instead of every 15 to 20 years.
DR. LYNCH: Comments? I see head nodding.
DR. LIPMAN: I think that nobody can disagree with the basic point that you are making, but scientists do adapt to dealing with large data sets. If you have the compute power and the data is there, you ask different kinds of questions. It does take a long time before more scientists within the community shift and start to think of different kinds of questions.
If you are used to generating small samples and whatever else, as opposed to getting a comprehensive -- all the genes of a genome and whatever, then I think that it takes awhile to get used to the kinds of questions you are going to ask. But people do. You will have a few pioneers that start to think a new way, and then it starts to happen.
So actually, I have a lot of confidence that the more data we generate, if the computers are there and the access to the data is there, people will come up with ways to ask the questions.
DR. KING: I think what I am asking is, are there lost opportunities. It seems as though there must be, but I just don't know.
DR. LIPMAN: The lost opportunity I think exists is, NIH a few years back set up this initiative called BISTE. I don't remember what the darn acronym was for, but it was to try to get more people involved in computational biology. They came up with a bunch of recommendations, most of which I did not necessarily think was that important.
But I think that really what I thought the goal of BISTE would be to say we need to train more scientists and give more money for computational research for the kind of work that Alex referred to, where you are analyzing other peoples' data sets, because there is a huge number of discoveries to be made.
So I think there is to some extent a lost opportunity, in that the number of biologists who are doing this kind of work -- and I'm not talking about people doing what a lot of people think of as bioinformatics, which is making databases and XML things and systems for other people to use, but actually doing the research yourself, and publishing papers which get published in biological journals. There is not enough people doing that. I thought the main goal for BISTE should have been training more -- although that was a goal, training people to do that kind of work, and more grants for doing that sort of research.
So I think to some extent, we are a little behind, in terms of having enough scientists doing that kind of work. So you're right.
DR. SZALAY: I think also this data avalanche or data revolution is going -- if the data gets properly published is also going to cause another fundamental sociological change. Today in science, a lot of people train through a lifetime to build up their own suite of tools that they apply to the raw data that they can get to.
But here, through these digital resources, one can get reasonably polished data, where one can think more about the discovery itself, and doesn't have to spend so much time with the mechanics of scrubbing the raw data and converting it into usable form. So people will be much less reluctant to cross boundaries, for example, if the data is all available in a ratified and documented form. I think this is going to have serious sociological implications.
DR. LYNCH: I think this is going to have to be our last question. You're up.
DR. DOYLE: One thing I am always struck at when I see these presentations is, it makes what we do when we publish simple papers look trivial. The amount of data that we do, the amount of text that we do. What really drives home the point is that -- this morning I mentioned the two to three orders of magnitude difference between pure dissemination and what a publisher might do in creating an archival XML and doing peer review, but then there is another two or three orders of magnitude up to what researchers actually are doing with their time.
That makes me hopeful that publishing is really becoming more and more of a minor cost, compared to the cost of doing research and these kinds of things. I would hope that to the extent that regular publishing would piggyback on these larger kinds of things, it is an important thing to be done, but it is not nearly as costly as doing the research or accumulating these much more complex kinds of things.
Anyway, so the question is whether you see the relationship, or if there is a transition to where these things become much more primary than the journal articles that come out of them now.
MS. BRADFORD: I was going to say that the number we just heard, 200 hours going towards communicating, there is a huge cost right there. That is the real publishing, the communicating. That is the idea, to get the idea shared.
So the amount of time someone puts into creating whatever product, be it a connections map or whatever, is significant. It is a time away from doing your basic research. So we are happy to hear from the authorities that at least it helps them organize it. It gives them an added value, in that they have to organize their own understanding and put a framework on their area. But that is a significant cost you can't forget about.
DR. KING: I didn't really mean that. I mean the cost for doing the traditional paper. We talked about $1500, is that realistic for a researcher to be willing to pay that, or some granting agency, rather than going through a subscription model. What you see is, the scale of the costs that are involved in these things is much greater than that.
MS. BRADFORD: I can't imagine an authority would pay to get to do this. It would have to be a totally different model.
DR. KING: No, I'm talking about the model just for the traditional publishing of papers, not for creating these richer things.
DR. LIPMAN: Implicit in your question though is, if you look at the NIH budget and you look at the number of publications, what is the relationship between the total amount that NIH spent and the number of publications that came out of it. One issue would be, is that going up. I'd have to say that probably it is. It is probably right now about $250,000 or $300,000 a paper, if you think of it that way. So the amount for doing the publication in some journal is a smaller part of that.
So there are issues in terms of economic analyses of open access publishing versus fee for service on the basis of that, but I think that even though you probably weren't asking the question, the thing Monica was referring to, which is how much time you have to think, and this other gentleman was asking, all those other things are factoring in as well to this.
In a sense, the paper does represent what you thought about what you did; it is the knowledge part of it. You can only do that so fast. So I think there is a difference between organizing data sets and making them usable to other people, which is a challenge, and finding and extracting what you think about that data set and getting it out there.
DR. LYNCH: I think we can squeeze in one more if it is really quick. You have been waiting patiently.
DR. SIMONI: I knew if I stood there with a sad look on my face you would submit. Actually, i have some prepared comments that I'd like to make.
Actually, I have a quick point and a quick question. My name is Bob Simoni, I am from Stanford University, and I work for the Journal of Biological Chemistry. With regard to the factual databases, I think it has not been said but needs to that part of their enormous success and our reliance upon them actually needs to go to the collaborative effort the journals have made. The journals no-wall requirement as a prerequisite to publication that the information be deposited.
Now, you might think that is a natural thing, but it is not so natural. One particular area, protein structure, the data that crystallographers used to gather used to be held and not deposited, and it has frankly been from pressure from peers to a certain extent, but more importantly from journals that that data be deposited as a condition of publication, that the databases are as robust as they are.
So a little credit, I think. I think the overall worth of those cannot be underestimated, or overestimated, or one of those two.
The question I have is, David Lipman, you mentioned gene expression information. I think there is a crying need for a centralized, normalized system for deposition of gene arrayed data -- or arrayed data and gene expression data. I understand that you have been working on that, and I'd just like to know what the status is.
DR. LIPMAN: Particularly relevant to JBC, we have been having discussions with David Klein on that, who is active with the JBC on that issue.
There are standards that people have tried to agree upon for what is the minimum necessary to be submitted to a database. Unfortunately, that minimum is so high that a number of scientists are not willing to do that. GEO, which is the database that we have at NCBI, has a somewhat lower threshold in terms of the requirements for submission, but we are in discussions with the international group on that issue.
GEO is growing a lot faster now. Some journals are requiring submission, and I think Bob Simoni's point about the critical role that journals have in terms of getting data in a useful form into these databases is critical. Despite the fact that databases are useful, scientists don't want to spend their time with that. What they are judged by is not what they put into a database, but what they published.
So I think the role of the journals is absolutely critical. I refer to what I would say again is critical, and JBC was one of the real pioneer journals in pushing submission tot he sequence databases and getting essentially 100 percent compliance.
So anyway, we are in discussions with David Klein and others on that, and we are already starting to get submissions from scientists on the basis of publication. I think New England Journal is requiring that.
MS. BRADFORD: Can I just make one point of clarification? Something that David said made me worried that I may have not been clear.
STKE is not at this point a sophisticated bioinformatics tool. We know that. What we are hoping for is in the next round of funding to be able to develop those tools that would allow you to look for inter-path connections, and use some of those things that we think will then make it more likely that these large data sets can lead to new discoveries.
So if I wasn't clear about that before, I apologize.
DR. LYNCH: It is time for lunch or for a break, for those of you who are out there on the Web. We will reconvene here at 1:55 promptly for the closing session. Please join me in thanking the panel.
(The meeting recessed for lunch at 1:00 p.m., to reconvene at 1:55 p.m.)
AFTERNOON SESSION (1:55 a.m.)
MODERATOR: We'll finish up in the next hour and a half maximum, I think. I had a couple of quick announcements so that you knew our plans with regards to the content of this meeting.
First, not all of the discussion, but all of the presentations plus the Powerpoint slides that have been used in this symposium will be up on the National Academies Web site shortly, a couple of days to get it up, I guess. So you will be able to hear presentations you may have missed, or encourage others who were unable to be here to listen to the presentations in the future. So they will be available for streaming audio.
The members of the steering committee will meet immediately after this session in the boardroom, in the far corner. That is just a meeting of the steering committee to assess things.
With that, we are grateful to Mary Waltham for putting together a fine panel that will help us put together everything that we have been seeing here for the last two days. Mary?
Agenda Item: Panel 6: Wrap-Up Session
MS. WALTHAM: This is a bit like the complete works of Shakespeare in two and a half hours. It is parallel to that. It also struck me that what I am going to talk about very briefly before I introduce my distinguished and wise panel is really just a set of summaries of some of the things that we have been discussing.
Although much of my corporate publishing experience was at Nature and before that The Lancet, in my role as an independent consultant now, I work with publishers who are both for and not for profit that are quite small and publish quite a few journals, and those that publish very many. So I have an opportunity to helicopter around over the area. I thought it might be useful for the purpose of this discussion just to present some of the things that we have been talking about over the last 36 hours. As I say, I have just put this together in the last 36 hours, so I hope it does the trick.
What I am going to talk about is publishing systems. That is what I feel competent to comment on. My panel are going to comment from their own views within their organizations.
What have we talked about? We have talked about costs, printed and online versus online only, and some discussion about, wouldn't it be nice if print went away. The only reason that print won't go away is because we need good permanent access to an archive, although there are some of us who believe that print will survive for some of our users.
In terms of archiving, who does it, what do they archive, and who curates it, who looks after it and how much of it. We have seen that there are economies of scale, quite vividly in some areas that we talked about yesterday. The end result of that is a general feeing that we are in a pool where there are some very big fish roaming around, and big fish tend to eat smaller fish.
We have talked about disaggregation of publishing, the creation part, the dissemination part, all the other pieces of it that are just breaking up, as various speakers have spoken about. A broad range of business models, and this looks likely to diversify and hybridize, good biological things are going further forward, and I think we will watch that emerge.
Filtering and quality control. We have talked a lot about stuff and putting stuff up on the Web, lots and lots of things. But I think from the publishing point of view, there is a need to filter that and to control the quality. A big question is, who will do it and will anybody pay for it in the end.
Increased online access equals increased usage. That is a common feature for most publishers. Customers and users want more granularity. Even the journal as a package of information is not granular enough. They may want smaller and smaller levels of access.
Copyright, who needs it? I use the term ‘needs it’ very, very carefully there, rather than wants it or uses it. I think that is very dependent on both the author and the mission of the publisher who they may be working with.
Online licensing activity costs money on both sides, libraries and publishers. There is a continuing publishing process coming through, where there is not a static document, but it continues to evolve, and lots of ramifications there.
Interoperability, common standards being essential for us to bring together all of this information, and some key roles there going forward.
Collaboration has been a big word here in the last day and a half, authors collaborating, users collaborating, publishers collaborating, libraries collaborating, clearly a big theme.
The use of the term publication is evolving. We heard this morning about pathways and data sets and some dramatic examples of that. Is a pathway or data set a publication? If so, what are the issues to do with credit and ownership?
Integration. The literature and the media are all getting closer together, becoming linked and integrated and indeed, this is what users want. I suppose a key point here for publishers is, where is the value added, and is where that value added where your users and your customers want it. I think that is a fundamental shift and change in this area. And how will the role of publishers continue to change to meet the needs of the research community.
So I apologize if that was a really quick run-through. It was simply to remind you of some of the things that we have talked about in the last 36 hours.
I am going to introduce all three of my speakers at once and tell you a little bit about them. The first speaker is Mac Beasley, who is a member of the National Academy of Sciences. He is professor of applied physics in the Advanced Materials Lab at Stanford. When I first met Mac, he was dean of humanities and sciences at Stanford.
Jim O'Donnell, sitting next to him, is provost at Georgetown University and professor of classics. Anne Wolpert at the end is Director of Libraries at MIT.
So with no further ado, I'd like to start off by handing the stage over to Mac.
DR. BEASLEY: Well, I do feel a little bit about bringing coal to Newcastle here, in the sense that I think my qualifications for being here are that I know Mary and I over wine have pontificated on issues surrounding electronic publishing and the like. I think this is her way of getting even.
I think more seriously, it has to do more with what she is asking of me, to come as an interested practicing scientist who is old enough to have been around the block a few times, so I take that point of view in my remarks.
I think also I have to say, and as you appreciate, I'm sure, that I come from a particular background and perspective. What I have to say comes surely from the perspective of an interested scientist, somebody working in the lab once again, with a physical science background. So I'm sure that some of what I say will be colored by that, but I'm not trying to speak for everybody, but rather try to say some things that may be interesting.
So what I would like to do is share some perspectives that I have gained, sitting here thinking about this and doing some homework over the last couple of weeks, and then sititng and listening for these two days, that I hope will be interesting, possibly useful, and almost certainly idiosyncratic.
I can't resist first of all saying on the print versus online, certainly in the community I live in -- you all have got some stories like this, so this is nothing unique, but I'll share mine with you. I did a little tour around the top floor of our building where my office is. There are maybe 60 or 65 offices there. I looked in to see who had print journals and who didn't. There were only two people that had print journals. Not to malign them, but they were both over 70 years old.
Now, we have four faculty on that floor, emeritus faculty, who are over 70, and two of them didn't have journals. So in the community I live in, they voted clearly and definitively, I believe, and it is not going to take but a very short time for it to be universal. But I don't know how general that is. I just couldn't resist telling the story.
The first more substantive thing I have to say, and again, I don't think it will surprise you, but I think it is nonetheless very important to say it. I certainly have the feeling, impression, or however you want to put it, that some sort of broad open access to scientific information bases of various kinds, journals, let's start with, is inevitable. It is just too important not to have them. That is a view of somebody who is a practicing scientist. I assure you that although they are younger and naive, they certainly -- that is the view the students are taking these days.
I fully appreciate even more, perhaps, maybe not as well as you all do, but certainly given what I have heard the last couple of days, it will not be simple to get there, and there are a wide variety of vested interests that have to be accounted for, and real issues, legal and everything else, that have to be accounted for. But it just strikes me as so important, that it must happen, and I think ultimately scientists will insist on it.
Nonetheless, I agree with Monica Bradford from Science magazine this morning, who said that this will be tension filled, and many of the points that I share, and I think it is a very important one, namely, that tension is not necessarily a bad thing if it is creative. However, we all know that tension can be destructive, and it is going to take some wise heads and good leadership to make sure that the tensions which are inherent in this are ultimately creative for science and all that partake in that enterprise. But we'll see.
My second impression, and again, this will hardly be a surprise to any of you, I just hope to put a little more novel twist on it perhaps, is that publishing, in quotes, is obviously not going to go away, I just can't imagine that, because it does provide value added. I think the oft-noted here in the last couple of days, the not seeming, but actual breakdown of the present system we have is clearly forcing an examination of what those fundamental values added are in the current publishing modes. Equally interesting is what alternatives there are in terms of publishing broadly defined to achieve those, certification and formatting and editing and very important things of that sort.
Most interesting to me because I am trying to think ahead a little bit, what new modes of value added are we going to create. There are some wonderful ideas out there. Half of them, maybe three-quarters of them, won't work, but that is okay. It is the quarter that do survive and do provide the essential value added that will be wonderful to have and very interesting to see evolve.
Also, to state the obvious, it is clear that access to information almost universally will continue to profoundly improve in speed and efficiency, and therefore in the efficiency of science. I think we all know that, and I think that is basically good. But the but for me is dealing with all that information, not only getting to it, but getting to it with some sense of the quality of it, the various other types of value added that might be seen. Hyper speed in itself is not the ultimate good. It is speeding up the process of achieving -- and I'll use the words as you understand them -- understanding as distinct from knowledge or simple facts.
We do ultimately have to achieve understanding in science, it may mean different things in different fields and so forth, but I think I did not hear in the discussions here a whole lot of -- well, I heard ideas that might implement that, but I didn't hear a focus on the things that have to be done in addition to what we have heard about in the last couple of days, to really bring that aspect of the scientific enterprise.
I believe that aspect of the scientific enterprise should have been brought in sharper focus, so I raise it here. I think people are thinking about it; I'm hardly the only one, I'm sure, but I think it is important to stress that we do need to do that, and I personally find it very, very interesting.
Another way of saying it is, we must find methods to improve and to accelerate our ability to form good judgments, in addition to understanding on the basis of the information that we will be getting which will be more, presumably better, getting it faster and so on and so forth.
Let me give you a concrete example. I think it is an important one. I haven't thought about it exhaustively, but I have observed over my now, God forbid, 30 to 40 years as a university professor that graduate students can't read the literature. It is too hard. It is too abstract, or it has got too much jargon, or it has got concepts they don't understand. Maybe this is something more specific to physical sciences, I don't know, but just simple access to it is not going to be enough.
So now I am wearing my university teacher hat, and I think there are some issues there as well. We are going to get all this information, but how are we going to help these kids deal with it. They may be more facile in using the tools, but they aren't going to be facile in this essential value added that I am trying to comment on.
So I think some ideas that are relevant to the questions I am raising were discussed here, and I think people are looking towards them. Let me give you a list that is only meant to perhaps define what I am trying to say better.
There are personalized searches. Well, that's fine. There are virtual journals. That is a good idea. But now we come to what I think is a little bit more what I am talking about. There are in Science and Nature perspectives on some of the articles. I am a fan of those, I'm an avid reader of those. I think that is an example of the kind of thing that I am talking about, because it helps me gain perspective, not only in my own field, but a little bit more broadly, and I am old enough that I want to do that. Or maybe it should have been done when I was younger, or whatever.
But anyway, it is an idea floating around in the condensed matter physics community, which is what I am a member of, and electronic journal clubs. There was a famous journal club. This is where somebody reports on a paper that is selected because it is excellent and important, and then it is critiqued, presented and critiqued. There was a tradition of this at Bell Labs which was initiated by Congers Herring, a truly great physicist, and some of the alumni of Bell Labs are now trying to institutionalize that by putting it online somehow. I don't know whether it will work or not, but I think it is an excellent thing to try.
I learned about knowledge environments today, and I learned about connection maps. I think these are again ideas that appeal to me as examples of what I am trying to say.
So again, as a university professor wearing my teaching hat, I find this aspect of the implications of electronic publishing and the digital revolution more generally very, very interesting and not ones that have received the attention either in this community as far as I can tell, and I would have to confess, not seriously in the scientific community and the universities, either.
Let me just close by saying, I do think that as I listened to this and I heard all these ideas of how to structure knowledge and so on and so forth, I had an interesting thought. I'll just throw it out there and let it float, that what is going on here in some sense is an experiment in epistemology. It would be very interesting to get some people who really understand epistemology to sit in and listen to all of this and say, what do we make of this. I'd like to know the answer.
MS. WALTHAM: So now over to Jim O'Donnell. Jim?
DR. O'DONNELL: Good afternoon. I'm from Central Administration, and I'm here to help you. True to my provostial calling, I shall speak as a two-faced administrator. True to my professorial calling, I shall tell you which face is speaking each time.
For the first of the personae that I bring with me is that of what I am sobered to realize this week must be one of the most senior open access electronic online journal publishers anywhere in captivity. We began publishing Bryn Mawr Classical Review (BMCR), an online book review journal for classical studies, in the fall of 1990. It has always been free, it has always been on the Net. We tried selling a paper version for a few years, that didn't go anywhere. We did an archival CD in the mid-90s and sold it for seven dollars, and made a good 30 bucks on the whole deal gross revenues. We are still in business. We are certainly the leading book review journal in the classics in the United States, and one of the top two or three in the world.
We have all the worries and anxieties of an open access publisher, including where the next dollar is coming from. But we are confident we are in business to stay.
In that capacity, I am well aware that the discussions, particularly yesterday, about the future of journal publishing and its models has two crucial elements. The first element of a new system is, it will drive costs out of the old systems. There is a general sense that we pay a lot for information and if we can possibly pay less, we would like to do so. I have been in this business long enough to have shared the dreams and imaginations of the early 1990s, when we thought that electronic publication would be somehow magically frictionless and cost free. We know it is not, but at the same time we know there are opportunities to drive costs out of the system. The presentations yesterday morning were particularly instructive in that regard.
It is a separate question to talk about recovering those costs and sharing those costs in an equitable and sustainable manner. The open access movement, such as it is a movement, it wasn't a movement when we got started in BMCR, tends to shift costs away from the user and towards the producer of information.
I practice that, and I recognize the value of it, and I have a qualm about it. It is the qualm that I want to share with you now. The qualm is that the adventures in the non-open access domain of journal publishing in the last 15, 20 years have demonstrated that among the features of high-priced paper or electronic journals is a form of quality control. When university libraries cut their budgets for serials acquisition, reduce the number of titles they acquire, they are exercising a form of quality control over what they will accept. If the big deal is fraying around the edges, that will transmit itself back to the publishers as news about what journals are to be sustained, what journals are not.
The actual closing of real journals, cutting back of titles by publishers over the last decade, has to my knowledge been achieved mainly through the Pacman gobbling up of one publisher by another, and the recognition that not every title was sustainable in the economic model of the for-profit publisher.
That is a qualm. But what I want to emphasize is a disagreement around that qualm yesterday. Voices on this platform spoke up for the open access model as delivering superior quality. I believe you can argue that the commercial model provides superior quality.
My insistence on this, and the point that I will leave, is that as you argue about business models, as you argue about costs and recovery of costs, the added dimension seems to me to remain an open question, the dimension of the quality on the product and the effect on the quality of the system that is under review. As long as that issue remains open, there is no forcing argument to use in favor of one model or the other.
A mere economic argument would not be such a forcing argument. Simply because one system is cheaper than the other does not prove that it is better, if indeed the quality of the science, the quality of the communication achieved were markedly inferior to what was achievable in the other system.
It seems to me that coming out of this conference, for me the recommendation would be to continue to assess what the effects on quality of information, timeliness of access, quality of peer review in the different models would be, and I'm not sure frankly how that discussion will play out.
If as an open access journal publisher however I am part of the solution, as a provost I am undoubtedly part of the problem for everyone. Commercial publishers will complain that I don't give enough money to the library, and that is why they think the prices of the journals are too much. The open access enthusiasts will complain that I don't underwrite their experiments in new forms of publication, because I have this silly complaint that I am still paying for the old system. I'd like to stop paying for that before I pay for the new system, please, and not be caught paying for two different systems at once.
Sitting here with that face on, I am struck by the emerging differentiation of the product that you, the scholars, the scientists and the publishers, want me to pay for. I was particularly struck yesterday morning by many of the numbers of costs and particularly Mike Keller's numbers, about the half life of information, the precipitous drop in commercial value of information over the first 12 months of its life.
That seems to me to suggest at least a tri-partition of the kind of information objects that are under discussion here. At one end, there is the timeliest of information services, providing the linked news from the front as rapidly as possible, with no consideration for archiving, no considerations for price, but simply getting the good news as fast as you possibly can so that science can proceed in its best way.
At the other end of the life expectancy of that information, service has become an artifact, something to be preserved, something to be maintained, something to be sustained long after its commercial life, long after perhaps even its scientific life has been exhausted. Preserving the science that is done today for the historians of science 100 years from now is an important exercise, but the historians of science 100 years from now don't have their budgets yet and can't come to the table with dollars to pay that task.
That first information service tends to be market based. That last artifact service is not market based at all; it is something done out of noblesse oblige for the greater good of the community.
Between there is a borderland area, where the information service itself needs to be mediated to those who have limited access to the market. If you have large research grants or you are in a major research university, you are probably doing okay regarding the information that you can acquire, but if you are in a developing nation, if you are in a small institution, if you are otherwise disadvantaged, you may not be doing okay right now, but I think we all in this room agree that there is some component of an effective scientific, technical and medical information system that will address those inequities in the market and find places at the information table for those who do not have the market clout behind them to elbow their way onto the table and pay whatever it takes to get whatever they need.
Understanding that differentiation of product which I believe is an increasing differentiation in an electronic environment, increasing among products and increasingly differentiated among disciplines will be an important part of understanding what a new system of information dissemination can be like.
If I am as provost part of the problem, what problem am I a part of? That is my question to myself. I offer you the advice of the best training for new provosts, is to fish out an old shareware version of Space Invaders or Missile Command, and play that game as the bogeys keep coming at you in greater quantity, faster and faster all the time.
If you want my attention for your problem, you have to understand that I haven't much time for anxiety. I haven't much time for something that might happen two or three screens from now. I am worried about what is happening to me right now. If you bring me imminent fear, that is a good way to get my attention, and if you bring me a total flaming disaster, that is an outstanding way to get my attention. I get first the emotional satisfaction of chewing you out for letting it go this far, and second, the satisfaction of digging in to do something.
Between, there is a concern not for the anxiety two screens from now, but for the substantive expectation two screens from now. I do accept that it is my responsibility to be at the table when we talk strategically about what is going to happen years from now.
My caution to this group is that a fair amount of what I have heard in this room, both from the platform and from the questions, has been not imminent fear, not reasonable expectation, but still shapeless anxiety. I have to tell you unfortunately, I cannot pay as provost to fend off every piece of unsubstantiated anxiety coming towards us. Eery other department, every other office of the university comes to me with the same kind of anxiety.
I am struck by how much progress we have made in ten years. I am struck by how the discourse has changed. I am actually optimistic in that regard. My question for my librarians, for my scientists, for my colleagues will be, can you make sure you tell me what the problem is? Can you make sure you make me understand that it is a problem, and can you make sure I understand what success would look like.
I think my take-away from this conference in that regard would be to say that it is time to begin disaggergating the scholarly and scientific publishing crisis, if there is one, into pieces that can be addressed in rational and coherent kinds of ways, as part of the many movements that have begun to emerge in the last ten years, as e-journals have become a reality, as open access has become an astonishing reality for me, and as the impact on academic culture has been so great. If you can give me a sense of the problem, give me a sense of what the solution would really look like, then you are going to find me to be a much better provost than you would otherwise.
Thank you.
MS. WALTHAM: Thank you, Jim O’Donnell. Now on to our last speaker for this symposium, Anne Wolpert.
MS. WOLPERT: This is great. When your name starts with W, you are usually at the back of the bus and the back of the classroom, and here I get to have the last word. I also get to increase Jim O’Donnell's anxiety, so this is a two-fer.
The views that you have heard expressed, we have all heard expressed the last couple of days have been largely those of the authors and publishers in the scholarly communications system, so I would like to take a moment to comment on what this world looks like, listening to what everyone had to say, from the standpoint of the part of the university that is responsible for bringing in, paying for, managing and maintaining access to the kinds of information resources we have been talking about for the last day and a half.
I want to remind us of something Hal Abelson said this morning, which was basically, to paraphrase, that the progress of science requires access to raw material and evaluated judgments and conclusions, to work that is current and previous. This is true in the university, not just in one's own discipline, which was the thread that ran through this, but in multiple disciplines, because universities are dealing with students who have not yet settled into a discipline or who are working in the interstices between disciplines. So we are paid in university libraries to think across those kinds of boundaries.
One of the things that I heard in this meeting has been that the raw material and the evaluated conclusions coming out of scholarly research are bifurcating into two fairly distinct realms. One of them is a quite tightly controlled, often highly priced peer review literature, and the other is the minimally controlled, scholar managed, open access publication and database regime.
It seems to me from where I sit that both environments present challenges to the universities that are by mission dedicated to providing homes for researchers, educating the students who come to our environments, and managing the information that is created today and yesterday for the benefit of scholars and students who haven't yet been born.
I think one could generalize from what we have heard the last couple of days that the controlled literature is making every effort to increase that control over the content that they publish, and to expand their reach both in time and in format.
We heard that there is considerable interest on the part of publishers of controlling the archive of those publications, so that they would be responsible rather than university libraries. The disciplines themselves or the publishers themselves would be responsible over time for the archiving of publications. We heard this morning the notion that one might also want to add data to that record, which would bring all of these resources under the intellectual property control of the publishers themselves.
I think this is a perfectly rational business model, and I say that from the standpoint of someone who manages the MIT press and struggles on a day-to-day basis with the finances of publishing, but it is not a perfectly rational business model if you happen to be on the university end of this formula and are looking at buying these materials from publishers under the kinds of intellectual property controls that are created largely for the benefit of the entertainment industry rather than scholarly publishing.
The university perspective on the value chain of the process by which new information is created is very different from the value chain perspective that you hear either from authors or from publishers. If you think about the value chain from the university perspective, the university builds the infrastructure. I was almost going to say they hire faculty, but that would be a mis-statement. We know from yesterday, no one hires faculty; they provide an opportunity for faculty to develop curricula and conduct research.
The universities attract and admit the students, especially the graduate students that conduct some of the research in our environments. They provide access to information at a cost that seems to them to be reasonable, as a percentage of the overall annual operating budget of the university. And they delegate that responsibility to their university libraries, which is where research libraries come in.
I will say, not everyone in research libraries believes that the subscription model as a way of acquiring information is fundamentally broken. What is believed is that the cost of different model is broken from the end user point of view, and that the terms and conditions under which information is allowed into our campuses is substantially broken, because licensing agreements control in many cases what subsets of a community can actually look at information. So it is the combination of the price and the constraints on the information that represent problems in the system for us.
So anyway, if you are looking at this value chain for universities, and you get to the point where you have to evaluate faculty, and you have to see to it that the work that goes on in your community gets published, one way to think about that from the university perspective is that it was outsourced.
So you come down the value chain to the point where you need to evaluate faculty and you need to get their work published. You say, I don't want to do that for a whole variety of perfectly good reasons, so I'm going to outsource it to publishers. So the publishers would handle the peer reviewing, they would handle the publication of the work, and it would come back into the university for the next generation of education and research at a price that was reasonable as a percentage of the annual operating budget of the university.
What has happened over the last ten or 15 years is that that outsourcing enterprise has in fact developed its own independent business model, that doesn't think of itself as having a symbiotic relationship with the university as much as it thinks of itself as a stand-alone, independent, able to create to create and craft its own business future, separate and distinct from what the university wants. It is in that environment that the costs have gone up, that the terms and conditions of use have changed, because there is now a very clear boundary between universities and the groups that capture the work that comes out of universities, publish it and evaluate the quality of that work.
One of the consequences of course of this separate and distinct business that is now at more than arm's-length from universities is that those businesses out there that are doing the publishing and the peer reviewing have come to think of universities either as patrons, which is to say they have a relationship where they owe the publisher a certain amount of money to support the work, or they think of them as pigeons, which is that they can be plucked endlessly to support a profit margin that they may or may not wish to support if they were asked. Or there is the alternative, which is that universities are bandits, and they are stealing this free information, trying to get for free information which was high value and added to by the publishers.
So it doesn't make for a particularly useful set of expectations around which to have conversations, although Lord knows we keep trying to do that.
The conundrum going forward for universities has several dimensions. First, for many publications, the costs are simply too high for the value received, and the licensing conditions are problematic, in terms of what we can do, particularly with digital information when it comes into our campuses.
As you heard from yesterday's panel, the intellectual property environment is not only incomprehensible to the average faculty member and student, but it places what happens at universities at high risk, because any legal regime that operates on the basis of the fact that if you need an answer about whether you can do something or not, and the answer comes back, it depends, is not a particularly useful legal regime for people to operate under. So people are afraid to litigate, they don't know what the outcome is going to be, and we heard a lot about this in the afternoon session yesterday. Now of course, there are new forms of data and information that clamor for attention for curation and funding on our campuses.
Some observations to conclude, that come out of having listened to this program from the perspective of the university librarian. It is apparent to me anyway that scholarly communication and scholarly publication are diverging. Once upon a time, you communicated to your colleagues through your publications; now it is quite clear that you can communicate outside of the formal publication record, and that the formal publication record is moving in many cases off to one side, which again affects the question of how much value do we put in the formal record of advances in a discipline, if in fact most of the communication is happening some other way. That is a little bit of what DSpace is about at MIT; it is to capture the communication, not the formal publication record.
I think we hardly have a clue about what reasonable standards and norms might be for the cost of peer reviewed publication. Those costs are all over the place. We don't know what drives them. We don't know what drives the cost of print as opposed to the electronics. So it is very hard for us to think logically without the kinds of norms and standards that you can get from most other industries.
It is pretty clear that intellectual property law that meets the needs of the entertainment industry and the international publishing conglomerates isn't particularly conducive to what goes on in the academy. We don't know what new models of peer review and recognition might be developed for open source publication, and that is an area of real attention for us all.
Lastly, I took away from this session a wonderful new business model for university libraries, which is what I call the offshore library model. I will go home and advise my colleagues in the state of Maine who are dying under the current cost structure for peer reviewed literature and have had to cancel right and left, that all they need to do is go to one of the 130 countries where this information is delivered for free, open a branch library, and they can have an offshore library. So thank you.
MS. WALTHAM: Thank you very much for three quite different reflections on what we have heard over the last 36 hours.
I'd like to start off by asking the panelists if they have got any questions they would like to ask each other, or having listened to each other, are they all in agreement before we throw the floor open.
DR. O'DONNELL: While you are that far away with the hook, could I say a word or two more by way of afterthought?
MS. WALTHAM: Yes.
DR. O'DONNELL: My third faith is a personal faith. I do want to add that I was struck by the increased vividness of the presentations today. I think Anne Wolpert's comments just now are quite relevant to the change in the divergence of ways, or the change in the combination of ways between communication and publication, and the building of these collaborative enterprises seems to be an exciting prospect, and I even suspect as a provost, I might be interested in solving that kind of problem.
That said, I thought we did address fewer of the problems there that might emerge as, for example, those huge databases age. I grant that in another ten years, five or ten pedabytes, what the heck, you'll be carrying it around in one of these, but it will still take some housekeeping and maintenance and something that begins to look a lot like either publishing or libraries or both to do.
But I think my question then would be, what is it going to take to do the good science and to make sure the good science gets done. That should probably be the bottom-line question and set of priorities.
MS. WALTHAM: Thanks, Jim O’Donnell. Any other immediate comments from the panel?
MS. WOLPERT: I actually would ask Malcolm as a scientist in a fine university what role he sees for the university, if any, in the curation and management of data that emerges from science? I'm thinking now of Dan Atkins and the cyberinfrastructure report and the question about where responsibility resides for the long term storage and curation of data.
DR. BEASLEY: Gee, I don't know. More seriously, I think Mike Keller does have some thoughts about that. I don't know if he is still here or not. I think that is in fact a more appropriate question for him.
But I would like to comment, I do think that -- and with the provost here, it will be interesting to hear how he reacts to this, too. One thing I would say is that the costs of providing these -- call them library services in this modern sense, let me confess as a faculty member, forget that I was a dean and maybe learned some things there, but faculty in general don't have a clue about the cost of these things, and therefore are not necessarily good informed people to make judgments or pontificate or anything else about how these things ought to be paid for and what the real tradeoffs are.
I don't know how the provost deals with that. It is almost inhuman. That is not a flip answer to your question.
MS. WOLPERT: I asked the question in part because I am interested in knowing where the discussion -- how and where the discussion can go forward on university campuses.
DR. BEASLEY: I do have a feeling, and now perhaps I am reflecting my decennial experience, that the faculty do need to understand these things better, because they are just going to go and beat up on the provost. To some degree, it simply isn't fair, and to some degree -- if you do it when you are just ill informed, it must be really exasperating.
I think there are a number of areas in which -- any area of the provost will get beaten up, I'm sure, but the point is that in this case in my judgment, it is such a revolution in the way we are doing science that it is too important to leave to the provost along, or something like that.
DR. O'DONNELL: I'm grateful for every scrap of pity that I can get. I think in that case, I would say that the real question is not what I want to do in the abstract, but where does it get done best, and I don't think we know the answer to that. I could understand an argument that would say only the scientists know what good quality curation looks like, and what maintains it in good enough form to be usable, therefore it must be done in house. I could accept an argument that the scientists haven't got a clue how to do it technically or managerially or with great metadata or with appropriate access or appropriate preservation, and therefore you had better outsource it to somebody who knows how to do that stuff.
The outsource question is then, do you outsource it inside the not-for-profit community, inside the university, inside the larger not-for-profit community, or do you finally outsource to the market. I just don't know where the answer would be.
MS. WALTHAM: I'd like to open the questioning up to anybody in the audience now, and to anybody who is listening from the Web cast. Any questions for the panel here, please? If you would like to identify yourself, please do.
DR. KING: I'd like to comment on Anne Wolpert's comment just now. If you present data to the faculty, do it in such a way that it is not just the price per title, but rather the price per article. There are some journals that the cost per article has actually gone down because the size of the journal has gone up much more rapidly than the price has.
The other thing is that you should also present it in a way where you give them the cost on the price by use. What has happened is that the number of subscriptions by scientists has dropped from 5.8 down to 2.2 over the last 25 years. As a consequence, all of that reading has now gone into the library. So academic libraries have roughly twice as much reading by the scientists, and in specialty libraries it has gone up four or five times.
So you need to give them the right indicators, if you want to use that term, but look at it in a way that is realistic for the faculty to understand.
MS. WALTHAM: Thank you, Don. I'm looking for questions. Marty Blume?
DR. Blume: Is a comment a question? I'd like to say on behalf of publishers from learned societies, if somewhat ungrammatically, to the professors and librarians and provosts and scientists here, that we are you, and you are us. Our society is run by professors, by laboratory scientists, the president, the vice president, past president, president-elect. It is operating off of our scientists.
We have been professors. We are still, some of us at night, under the covers with a flashlight where we can't be caught, still researchers, at least the theorists among us are. So the values that we have are the same values that are expressed by you. We understand some of the problems of the publications perhaps a little better, because we have been thrown into it, and we are therefore forced to think about it. But on the other hand, we need your understanding that this is where we are coming from.
Our publications committee is chaired by an industrial scientist. The members of it are university professors and laboratory scientists as well. Paul Ginsparg is a member of that committee, so we have representation from the electronic archives. All of this comes together at this point, so we can't be accused of having a very narrow point of view. We are just better educated.
PARTICIPANT: I'll try to phrase these things as questions, if I can. First, for Jim O'Donnell, would you say that you really implicitly gave us the answer to the dilemma you posed, that if it is possible to overcome the arrogance of the scientists and the humility of the archivists, that the best solution, the natural solution, will come from these two groups of people learning to talk to each other, to deal with the problems of these enormous databases, that each brings something, but not a complete thing, to understanding how to do it?
The other is for Anne Wolpert. Would you accept that the role of publication still remains central to the process of communication in the sciences? I may talk a lot more with colleagues in many different places now than I ever did, but when it comes to the substance on which we base our inferences and our new ideas, we have to go back to the literature. We have to go back to the publication from last month or the last decade. The scientific publication still remains the rock on which we build, and the thing that has changed most is the communication and the way we collaborate with each other.
MS. WALTHAM: Thank you. Shall we take those in turn? Jim O’Donnell, do you want to take the first one?
DR. O'DONNELL: Sure. I think the critical intellectual capital will be formed in the dialogue between the archivist, the information scientist on the one hand, and the working researcher on the other. If I had a druther, I would like the institutions and the not-for-profit sector to retain control of that process, that conversation, to hold on to the intellectual capital. Then I am willing to talk about who should do the dirty work and make the things actually operate.
I was brought up short once ten years ago, early in the days of BMCR, when a librarian turned associate provost asked me if we can done a good job of marketing to non-classicists and I said, not really. We pretty much know who all our readers are. She said, no, you don't. The one thing you know in a library is, there are lots of other people who have an interest in the stuff you are doing besides you, your six friends, and the folks you think you are doing this for. That really adds an order of magnitude of value that I think we always need. Getting the dirty work done adds another value, but it is a very different kind.
MS. WALTHAM: Thank you. Anne Wolpert?
MS. WOLPERT: Mine is easy. The answer is, of course. However, I will say that the challenge that confronts us now is the migration of the official version of the publication of record from print to electronic formats.
For 15 years, people have been asking me, when are you going to go all digital? The answer is, it depends, like law. But as a practical matter, I think we are in such a serious transition phase right now. I have no idea how long this is going to last, but we will know that we have successfully tipped into the electronic environment and learned how to archive material that is formatted digitally in reliable and sustainable ways when someone gets a Nobel Prize based entirely on an electronic publishing record. That is my standard.
DR. BERRY: I would just throw in one thing, too, partly to needle Marty Blume. It has sometimes been said that paper is much more permanent than any current electronic form, and we always have to face the dilemma of transferring last year's electronic archive into next year's electronic archive.
But I will point out that there was a period when Physical Review was printed on very high acid paper, and if you opened a mid-70s Physical Review now, the pages crack and come apart. It was because the APS was able to transfer those fragile paper versions to electronic that they have been able to save the records in Prolog.
MS. WOLPERT: But you could also put them in a refrigerator.
DR. BEASLEY: I'd like to make a remark. Steve Berry, you were commenting about, after this discussion with your colleague, you wanted to go back to the literature. But to illustrate the point that I was trying to make, when you log in with your wonderful search engine and you get those 45 papers, you may get weak.
There has got to be some way of distilling and grading and making access back however far you want to do. I couldn't resist -- since you stuck Marty Blume, I thought I would stick you.
MS. WALTHAM: Mike Keller, I think you are going to respond to Mac Beasley's question?
DR. KELLER: I'm going to respond to Mac Beasley and maybe ask a question.
MS. WALTHAM: I want to fit everyone's question in, so we are going to go fairly briefly, then on to Bob Bovenschulte. Would you like to say your name as you start out?
DR. KELLER: My name is Michael Keller from Stanford. I am going to try to hit the lob that Mac Beasley threw over my way. Just some figures to make some of this real for you.
The underwriters who work for Stanford and try to persuade us to spend a lot of money on insurance say that the collections that have been amassed there in the libraries, now amounting to about eight million books and probably 30 or so miles of archives and so forth are valued at $1.2 billion. It is one of the largest assets the university has, other than its own land.
In the last decade, the decade starting with 1993 when I arrived and when quite coincidentally, the World Wide Web became massively and widely available, we spent $73 million on capital projects, the most recent of which is to buy land to construct the first of what might be six or eight modules to store books and other physical artifacts, as well as perhaps to house a very large digital store off campus.
In the same period of time, I think, John, you would agree, we spent about five million dollars, earned mainly, almost entirely, from our publisher clients at Highwire, putting out about a million articles for them.
Don King, if you take my figures, that the cost of IT and e-publishing today is about five percent of the whole budget for publishing, you might come to see that $75 million for a million articles means about 35 bucks per article, and that means something like 1500 bucks for the whole creation of the article. I think your figures are confirmed.
As we work on this problem of non-physical artifacts, virtual artifacts, I think the solution lies, as it has in the past, great libraries are archives are not built by archivists for their own sake, they are built with the connivance and with the cooperation and with the jostling and poking from the scholars who most directly receive benefit from them.
As the scholars in science, technology and medicine very clearly see the advantages to some of the research possibilities that the big data sets and the big collections of electronic resources come to offer, then we will see more and more impetus and more and more money coming our way to provide them with the research materials, which then become artifacts for preservation over a long period of time.
We are preparing to do that, but everything that we have got is in embryonic phases, without exception.
MS. WALTHAM: Thank you very much.
DR. BOVENSCHULTE: Bob Bovenschulte of the American Chemical Society. This is for Anne Wolpert. I could conclude from your remarks that you might roughly divide all the deals that you have to do into three categories: good deals, acceptable deals, and bad deals. I am curious to understand better the interplay of forces within the university when you accept a bad deal.
MS. WOLPERT: There are different flavors of bad deals. Some of them have to do with the cost of the deal, and some of them have to do with the terms of condition and use that come with the deal.
A bad deal for a university is a deal which over time consistently favors the needs and interests of one group within the university over others. I think part of the political risk that the scientific community runs right now, and where we are starting to see push-back in the university environment, is around the constant percentage increase in demand to support the scientific and technical literature out of a finite amount of money. So at the end of the day, someone on the campus gets shortchanged as a consequence of the need constantly to feed a set of growing expectations about the payout from university library budgets in support of scientific and technical literature.
So from the standpoint of the university, ultimately the groups that are being shortchanged, some of whom may be the classicists among us and the parts of the university that aren't -- you all know, you have been around universities long enough to know that disciplines cycle through favor. Around the turn of the last century it was mechanical engineering, and it was civil engineering and then it was physics, and right now it happens to be the biological sciences. But as a practical matter, sometime they will cycle out too, and something else will replace them.
So libraries struggle to maintain a balance on their campuses from the expenditure point of view between and among the disciplines. So the potential for long term damage is there, because you can't go back and buy a book that isn't available anymore. So if you haven't built it in your collection at the time you could, you can't.
The other worry that we have is about the terms that licenses provide to us about who can use material and under what conditions they can use material. That sometimes disadvantages parts of the campus. You can afford to license for a limited number of users, or only for one subset of your community, and then that disadvantages those who can't easily get to use it. Those things can't be networked, or they can only be networked within a particular subset of buildings on your campus.
So these are the kinds of complications that we really didn't have ten or 15 years ago, that we are now confronting on a regular basis.
MS. WALTHAM: Thank you. Gentleman on the left.
DR. NEAVILL: Gordon Neavill, Wayne State University. It seems to me that one of the problems in the digital environment is that the economic link between current and retrospective information is broken. In the print environment, almost all the information that libraries bought was bought because it was current information. It was then simply retained and became valuable retrospective information.
In the electronic environment, we pay once for current information, then we probably have to pay all over again, at a fairly high cost, to capture the same information for very important but a low level of retrospective uses. In the case of electronic databases, you may be making snapshots, but the need for the snapshots will be -- few people will be using the retrospective information than need the current information.
To some extent, these retrospective costs can be shared, but if there is a question here, it is, can electronic systems be designed to minimize the additional cost required to retain them for retrospective purposes.
MS. WOLPERT: There are actually two ways to think about that. One is that it is not quite correct to say that -- although when you only had one format to deal with, it is true that that one format served a variety of needs: the current information dissemination, the near-term research requirement and the long term archiving requirement.
What we are seeing in the digital environment is that people want to use the digital materials for the sake of convenience and productivity, but there are difficulties with that, in that the electronic material doesn't perfectly mirror what came out in print. So you heard maybe from Wendy, I can't remember who said it yesterday, that a third of the use of the Elsevier titles was for non-article material. So there is a lot of information in print publications that is not in electronic collections of journals.
So that is one issue. The other way to think about the electronic environment is that there are presumably costs associated with buying the retrospective collection on an annual basis. If you had a choice of buying one year's worth at a time in terms of your budget, you could do that and you could stop. But in the electronic environment, you have to buy the current year plus the archive in many instances over and over and over again.
Certainly that is the model for reference books. You might have been able to buy -- if you had limited funds, you might have been able to buy a scientific and technical reference book one year and then buy it on an every other year basis or every third year. If you buy it electronically, you pay the same price year after year after year. So it affects the economics of how you think about your collections, not just in terms of the current material, but in terms of how you manage the archive.
DR. O'DONNELL: I might add there, the retrospective collection has never been free. It costs a lot of money to keep it warm and dry and cool, and to re-shelve it.
We spend immense amounts of money on redundant collections. I am a classicist, remember, I'm not supposed to say this, but I think you have to recognize it is objectively true. Where off-site shelving is beginning to be done cooperatively among institutions, it certainly provides an alternative opportunity to think about just how much redundancy is worth paying for in terms of the use we get out of it.
Now, I will always be the aggressive one, insisting that there is a lot of use to be gotten out of it. But at the same time, I think we have to be rational about just how much we spend on the print archives. And it is a lot.
MS. WOLPERT: Which reminds me that we are talking about different kinds of money, too. If you build a building, you need capital funds for that. That is a big one-time effort, and then you coast through it at a relatively low cost. So although you amortize the cost of the materials that are stored there over the life of the building, in fact, for the university the economics work differently.
What you are talking about in the electronic environment is annual operating costs as opposed to being able to move some of those costs off into a capital budget, and manage them differently.
MS. WALTHAM: Thank you. This is probably our last question.
DR. ARRISON: My name is Tom Arrison. I'm with the National Academies. This idea of the dialogue between the scholars and the librarians is very interesting to me.
It made me think of the instructive example with MIT and OpenCourseWare, where there is some process which led to this vision which is fundable, getting money from the Mellon Foundations and Hewlitt Foundations of the world to implement. I just would be interested in any of the panelists' reflections on whether the -- sometimes this dialogue and reflection on a campus can lead to a vision that can bring in the resources to implement it. I'd like to ask whether in your individual institutions that kind of dialogue is going on, is adequate now. Also, more broadly, what do you think needs to go on between institutions and among the scholarly communities, and how maybe with this project in later stages might help address that.
MS. WALTHAM: Thank you. Who wants to start?
DR. BEASLEY: Well, in terms of dialogue, it depends a little bit what you mean. There certainly is a faculty committee that interacts with the administration, Mike and others, to deal with these questions. We hear reports and so on and so forth, and I have no doubt that it is a substantive discussion, given the people I know who are involved and whatnot.
But a wider discussion than that through the normal faculty senate committees and so forth I don't believe is widespread. That was a point I made earlier. The question is, is this an issue that is sufficiently important that one ought to try to have a wider discussion or not. I would argue that I think it is a candidate, anyway, because of its importance to the scholarly side of what we all do. But I don't think it would be fair to say in any institution that I know that that discussion is highly developed.
DR. O'DONNELL: I think I would say that the learned societies play a stronger role than the individual campuses do. They can aggregate more resources, and they can bring more intellectual firepower to bear on discipline-specific kinds of questions than happens in a university faculty, a library advisory committee, with one art historian, one botanist, one sociologist, and we're not sure what that other guy does.
MS. WOLPERT: But perhaps one of the points to take away from the discussion over the last day and a half is that there perhaps need to be vehicles to encourage that discussion, because there are conversations within institutions and groups of institutions, the CIC, IV Plus and so on, and there certainly are conversations in disciplines, and to a lesser degree among and between disciplines. But there is no easy way for the conversations to happen across the boundaries that we just described.
DR. BEASLEY: One other thing. You commented that universities have been able to get resources from foundations and whatnot to do some of these things. I think it would be more accurate to say, to try certain experiments. It is not the intended role of any foundation I dealt with as a dean to pay the continuing costs.
MS. WALTHAM: Thank you. One last final question.
DR. GREENBERG: I'd like to make just a brief comment, low tech comment.
MS. WALTHAM: Go ahead. Your name, please?
DR. GREENBERG: My name is Mike Greenberg with the Biological Bulletin. A comment that Malcolm Beasley made about the difficulty of reading physics articles. It is true of articles in any science. Writers write for their colleagues. But people in psycholinguistics have learned for a long time about how to write for readers. George Gopin wrote a really good article on the science of science writing in the late 1970s.
The reason that people like to read those short articles in Nature at the front end of the magazine, or in Science, is because they are written to be understood.
MS. WALTHAM: Thank you very much. What a great note to end on. Thank you very much to this excellent panel.
DR. SHORTLIFFE: I will draw things to a close. I thank you all for your attendance. Remind you that these materials will be on the Web site, the audio recordings and the slides, and also remind you that one of the charges that the steering committee has is to take the lessons of the last day and a half and try to crystallize them, and in particular to ask what are the key potential areas of study that the NRC might be focusing on for additional work here, to help move these issues forward.
There is a huge number of potential topics for study. I heard a plea here just in the last hour for disaggregating and focusing a bit on the ones that may be most amenable to that kind of work, but thoughts from any of you about where you think the most useful contributions of the NRC might be, or the NAS might be, in the way of new studies in this area, areas that haven't been looked at effectively by others perhaps, and where there is a need for work, we would welcome that advice. All the members of the steering committee are on that list of attendees with e-mail addresses, and you could send a note to any one of us.
So with that, thanks again to Paul Uhlir, to Julie Esanu, to Kevin Rowan, who have all been key players in making this happen. Thanks to all of you for joining us.
[Whereupon, the proceedings were concluded at 3:10 p.m.]
|