|
NOTE: This is an unedited verbatim transcript of the Symposium on Electronic Scientific, Technical, and Medical Publishing and its Implications prepared by CASET Associates and is not an official report of The National Academies or of the Committee on Science, Engineering, and Public Policy. Opinions and statements included in the transcript are solely those of the individual persons or participants at the symposium, and are not necessarily adopted or endorsed or verified as accurate by The National Academies.
******
THE NATIONAL ACADEMIES
COMMITTEE ON SCIENCE, ENGINEERING AND PUBLIC POLICY
SYMPOSIUM ON ELECTRONIC SCIENTIFIC, TECHNICAL AND
MEDICAL JOURNAL PUBLISHING AND ITS IMPLICATIONS
May 20, 2003
The National Academies
2100 C Street, N.W.
Washington, D.C.
Proceedings By:
CASET Associates, Ltd.
10201 Lee Highway, Suite 160
Fairfax, Virginia 22030
(703)352-0091
* * *
PROCEEDINGS (8:32 a.m.)
DR. SHORTCLIFFE: In the interest of starting on time, I welcome you to the second day of the symposium.
Before I introduce this morning's first panel, on the chance that some of you have to drift off at the end of the day, I thought I would say a few words right now about the steering committees. I assume all of your gratitude to those people here at the staff of the National Academy of Sciences, who have played a key role in putting this together, serving as members of our committee to a large extent, knowing this community as well. That is Paul Uhlir and Julie Esanu, and Kevin Rowan. Thank you very much for the staff support for this meeting.
Without further ado, we will start our future-oriented sessions for today, the first one chaired by Dan Atkins. Dan, do you want to introduce your panelists and take it from here? Thanks.
Agenda Item: Panel 4: What is Publishing in the Future?
DR. ATKINS: Good morning. And good morning to those in Web cast-land. I have been spending a couple of months on sabbatical in Berkeley. I am particularly appreciative of any of you who are in California attending this through Web cast this morning.
We have a very distinguished and interesting panel today. Paul Resnick from the University of Michigan, Richard Luce from Los Alamos National Lab, and Hal Abelson, who will help us think about and discuss what is publishing in the future.
There are short bios of the speakers in your packet and I won't repeat what is there. I just want to emphasize a couple of aspects of the work and focus of our panelists that I think are particularly relevant to their presentation today.
Paul Resnick conducts a wide range of research in the general area of enabling productive social relationships by use of information and communication technology. During his spare time here, he was out studying the emergent process of ride sharing in the Virginia suburbs, for example, and thinking I'm sure about ICT applications to that process. He is widely recognized as a pioneer in the area of ICT based recommender and reputation systems, and will focus on that in potential relevance to publishing today.
Richard Luce is the leader of a wide array of initiatives at Los Alamos that fall under the general rubric of library without walls, and is a pioneer in the open archives initiative.
Hal Abelson is founding director of both the Free Software Forum and the Creative Commons -- you heard a reference to Creative Commons yesterday -- and primary instigator of both the open courseware and the D space projects at MIT.
Yesterday there were a couple of references to this report of this blue ribbon panel that I chaired over the last year and a half. Here is a URL that you can go to to get PDF versions of the report. If you forget the long URL, and for those of you in Web cast-land, it would be simpler to send me an e-mail, atkins@umich.edu. I also can probably can NSF to send you a glossy cover version, hard print version of the report, if you want that artifact.
To set the stage for today's talk, we are all aware that the digital evolution is disaggregating the traditional processes of many knowledge intensive activities, but in particular processing of publishing or scholarly communication, to use a broader term, and is offering up alternatives in both how the various stages of these processes are conducted and who does them. So the functions of repository or metadata creation or credentialling review or whatever and long term stewardship can be separated, disaggregated, and different players than traditionally have carried out these tests can in theory perform them.
It is also changing the process by which knowledge is created, by which discovery takes place. This is most true, and the pioneering efforts in this arena are within the STM field. That is the central theme and the central study.
The creation of this report involved extensive testimony with leaders of essentially the entire spectrum of science communities funded by the NSF. We clearly documented an emergent stretched vision and enhanced aspiration of many science communities in the use of ICT to build quite comprehensive environments based on what is now being called cyberinfrastructure, things that go by names of collaboratories or grid communities, which are functionally complete, in the sense that all of the people, the data, the information, the instruments that one needs to do that particular activity in that particular scientific community of practice are available through the network and online.
So indeed, a growing number of research communities are now creating ICT-based environments or cyberinfrastructure-based environments that link together people data, information, computational tools, services, instruments, in ways that are functionally complete and relaxed constraints of distance, time, and where distance can be geographical and/or organizational, cutting across organizational boundaries and/or multidisciplinary.
There is a general trend we identified towards more interdisciplinary work and broader collaborations in many fields.
This is one of the figures from the report, where we have got the base of storage, computation, communication that is continuing to accelerate and rocket past at exponential -- or as Jim Duderstadt said yesterday, in some places hyper-exponential. There is this networking operating systems middleware layer. Then on top of that, the aspirations and recommendations of this report to try to create as common as possible a level of high-performance computational services, data information, knowledge, management services, observation, measurement, fabrication, visualization and collaboration services. Then to provide the wherewithal that this infrastructure layer could be deployed and customized on behalf of specific communities or interdisciplinary communities.
So some implications for publishing -- and there is an alternate executive of the big idea we are trying to get at today in the panel description. I have deliberately chosen another, because we are kind of in the horseless carriage stage of all of this, and the nomenclature is not quite there. But publications can now exist in many intermediate forms. We are moving to a possibility of more of a continuous flow model rather than a discrete batch model.
There were actually phrases used yesterday by various people that are in the vicinity of the idea we are trying to talk about. Wendy Lougee, I believe it was, said we are experiencing a shift from publication as product to publication as process. Jim Duderstadt made a reference to parallel flows.
One of the questioners later in the day came up and said, we are talking about largely today automating better, faster, cheaper what we have done in the past with ICT. We really ought to start thinking about using it to do new things new ways. He pointed out that we are moving into a model of continuous improvement in the digital realm, and also opined that that requires open access.
So raw data, process data, actual replays of experiments and deliberations that are mediated through a collaboratory can actually be captured, replayed, re-experienced, working reports, pre-print manuscripts, credential or branded documents, or you can even imagine things post peer review moving to some kind of hyper ranked state, or actually undergoing the privilege of being annotated by the leaders of the field. Those annotated versions of the documents now become available, and so forth.
The work products can be available at varying times to varying people with varying terms and conditions. So you again have a whole host of customized options that can be available.
Publications need not necessarily be pre-credentialed before publication on the Net. Their use in the Net can be credentialling. This is a way of talking about this that I owe to George Furnas, my colleague at the School of Information. In theory, every encounter with the document may be an opportunity to rank it in some way, create some kind of a cumulative sense of this impact or importance, could have alternate credentialling entities or methods, and you could pick your favorite.
The best bubbles up in use through the document's social life, in the sense of John Seeley Brown's Social Life of Information. Perhaps open source models could be applied here. That is beyond the scope of what we are going to talk much about today, but I think it is worth pursuing.
The raw ingredients, the data, the computational models, the instruments, the records of deliberation, could be online and accessible by others and conceivably used to validate or reproduce at a deeper level than has traditionally been possible.
Finally, the primary source data can be made available with at least a minimum set of metadata and terms and conditions. Then third parties, particularly in an open access -- and this is partly a part of the open archives initiative -- third parties can then add value by harvesting, enriching, federating, linking, mining selected content from such collections of open archives.
So in this session, our goal is to try to describe, illuminate and inform discussion about some of these emerging technologies, the related social processes, some specific pilot projects, challenges and opportunities that may provide the basis for these kind of future publishing processes. I put that in quotes, because we may someday not actually think about it explicitly as a publishing process, but more holistically integrated into the knowledge creation process.
With that then, I'd like to introduce our first speaker, Paul Resnick, from the School of Information at the University of Michigan. Paul?
DR. RESNICK: Thank you, Dan. When people talked about changing -- yesterday what I heard when people thought about changing the current publication process, I heard concerns that things would descend into chaos, that no one would know what documents were worth reading, because you wouldn't have the current peer review process.
I'd like to turn that idea on its head. Instead of going without evaluation, we will have much more -- there is a potential to have much more evaluation than we currently have in the peer review process. We can look at what is happening outside of the scientific publication realm at other things going on on the Internet to give us some clues about where this could go.
Today's publication cycle, there are reputations for publication venues. Certain journals have a better reputation than others, certain academic presses have strong reputations. There are a few designated reviewers for each article, and those reviews gate-keep the publication. It either gets in, or it doesn't get into this prestigious publication; it is a zero or one decision, binary. Then afterwards, we do have citations years later as a behavioral metric of how influential the document was.
If we look out at some other realms on the Internet, I think we can see some trends, and maybe we should see how they apply to scientific publication, where there might be putation for individual documents, and also for the reviewers. There would be lots of public feedback, both before and after whatever we mark as the official publication time, and we have lots of behavior indicators, not just the citation counts.
So let's think about some examples of publicly visible feedback. This is a watch, not unlike the one that I eventually bought, that was available for sale on Ebay. It was selling for $48; I actually ended up paying $99 for this one. It was sold by Fastsell 2, which made me a little suspicious, but that number 793 tells you that Fastsell 2 has actually done quite a few sales, and 793 people or a few more than that have left comments, almost all of them quite positive. But if you look down near the bottom, there was the one complaint: I like the watch but the face is scratched. If you are going to sell a used watch, you should specify it.
Some people would say a few of these complaints -- and it really is only a few -- maybe I should go ahead and buy it anyway; others of us who are more concerned or just don't want to deal with the hassle went and bought it retail.
The same kind of thing, not just for evaluating who you can choose to trust on eBay, has been applied to conversations. Here is one conversational site, it is a political deliberation site, and the stories get rated by people. If you are familiar with Slashdot, the same kind of thing happens with comments there. But here, they have arranged these stories based on the expressed interest of previous readers. So the one at the top has an expressed relevance rating of 201. The one at the bottom has a relevance rating of 98. Presumably there are some more further down that have gotten even lower ratings.
Amazon.com, we're getting closer to the publishing world. This I'm sure all of you are familiar with. There are reviews at Amazon, both text reviews and numeric ratings that any reader can put in. Many of us find this quite helpful for books. We don't quite have it for individual articles in scientific publishing yet, but we do have this for books.
Even closer to the scientific publishing world, this is a site called Merlot, which collects teaching resources. They have a whole peer review process. You can see here, I have shown a couple of resources, one called DNA From the Beginning, another called Physlets. For each of these they actually have a peer review process before they will include it in the collection at all. You can see, the peer review is over at the right, but even after it is published, members can put in comments.
Typically, teachers have tried using it, and they say this is what happened in my classroom and so on. The member comments don't always exactly agree with the peer review comments. You can see for the one at the bottom that the average rating for the member comments is a little bit lower than the rating that the peer reviews gave it.
So that is a sense of what is happening with subjective feedback that people can give and make public about things that they are using.
There are also behavioral indicators. You don't ask people what did you think of this; you watch what they do with it. That is like the citation count. Amazon, in addition to those customer reviews, they have the sales rank, in this case, 3,361. I haven't written a book, so I haven't had the pleasure of doing this, but the people I have talked to who have written books find it quite addictive to keep checking back and seeing what their sales rank is.
Here is Netscan. This one is going to require a little bit of explanation. Netscan is a project that Mark Smith at Microsoft Research has been doing to collect behavioral metrics on Usenet newsgroups, the Net news. You've got all kinds of metrics about the groups.
This is one that is organized by author. I have gone to a particular newsgroup here, and I have gotten a list of the authors from that newsgroup in the last 30 days. I picked the third one down and expanded it to see -- if you look at row number three or any of the rows, you see that you get metrics about this user, how many days out of the last 31 were they active in this newsgroup, did they post something; what is the total number of posts that they have done, how many different threads did they touch, how many new threads did they initiate in the conversation, and various other metrics. Then you can go get more specific things about particular threads that they have been involved in, and you can see that in some cases certain authors tend to dominate threads and other authors manage to say a little something in every different thread, and you find out something about the users this way.
Google uses behavioral metrics of links in their page rang algorithm. This is one that many of you have gone to check, how am I ranked on Google on various search strings. For reputation systems, the stuff I care about is right up there at the top. It turns out if you search on reputation or reputation system, I don't do so well.
But they are not just doing a text match; they are also taking into account how many links are there to this page from other Web pages, and they are even weighting it a little bit by the rank of the other pages that are linking into my pages. So this is a ranking system that takes into account this behavioral metric of who is linking to whom.
This is SSRN. It was mentioned yesterday. This is the download count as a behavioral metric. This is the all-time top ten downloads, 30,000 for the top one, but of course they also have them in lots of different categories, so that more of us have a chance to be a winner.
You are going to hear some more about some behavioral metrics that Richard Luce has been looking at in the next talk.
This is to give you a flavor of some of the things that I think we ought to be thinking about in the credentialing processes for scientific publication in the future. But there are some issues to deal with. Some of these are active areas of research for people who are working on recommender and reputation systems.
An obvious one is gaming the system. You make a bunch of Web pages that all point to yours so that Google will rank yours higher. In fact, there is a whole cottage industry. You can hire consultants who will help you get higher in the Google rankings.
It is a little harder to do this with the Amazon sales rank. It requires you to actually require some books. But people do try to figure out, when should I buy the books and should I concentrate it and buy them all at the same time, so that I temporarily get up there and get noticed, or should I spread out. You try to figure out what their scoring metric is and game the system.
You would really want to think about it as you are designing these metrics. The ideal metric would be the strategy proof, that the optimal behavior for the participants is just to do their normal thing, and they can't easily game the system. It is not always so easy to design the metrics in that way.
Another problem is eliciting early evaluations. In these systems where you have widespread sharing of evaluations, there really is an advantage to going second, let somebody else figure out whether this is a good article to read or not. And of course, if we all try to go second, then there would be no one who goes first.
Another problem can be herding, where the future evaluators don't really reveal what they thought of the document, but they are somehow overly influenced by what the previous evaluators thought. They just go along with the herd.
There are some interesting ideas that potentially would help with the herding problem, for evaluating evaluators, where you might reward evaluators for saying something that goes against the previous wisdom, but which subsequent evaluators agree with. That would be the person who finds the diamond in the rough, would get special rewards; the person who just gives random reviews would get noticed and would get a bad rating.
Also, this is going to require going back and revisiting some of the decisions that we have made about anonymity and accountability in review processes, single blind, double blind, not blind at all. I think we are going to end up for different purposes wanting different versions of that.
I'd like to suggest a few small experiments, and then I will conclude by saying where we might go in the bigger picture. I think for journal Web sites, some of these are more radical than others, but you might publish the reviewer comments. I think those could be of interest. I think they would cause the reviewers to be better, if they knew their comments were going to be published, even without their names attached. You might think about publishing the reviews for the rejected articles as well. I think you would get fewer really bad submissions if people knew that it wasn't free, and that they could potentially be hurting their reputation by having the reviews of that article up there.
Then after publication, we are all running Web sites where people can at least get the abstracts, so how about letting people put comments in that would be publicly visible.
Some other experiments are to try to gather more of these metrics. We are starting to see things like with Sightseer in the computer science area, where they are measuring citations in real time, but also using the link data, using the download data in those places where people are actually reading online get the reading data, get how many times this is being assigned in courses, all kinds of behavioral metrics like that.
Start to think about experiments in evaluating the evaluators. My guess is that the best place for this might be in some of the conference proceedings, where you have at least in the ACM world a number of evaluators for an individual article, so you could actually -- I know when I do those reviews, I always go look and see, was my review somewhere in agreement with the other reviewers, or was I big outlier, and if I was a big outlier, was I right or not. You might actually make that a more explicit thing for evaluating evaluators.
The question is, is this going to be the future of scientific publishing. I think we ought to at least consider it as a possibility, that the author would be the only gatekeeper on publishing, that there would be credit for authors based on reader feedback and these behavioral metrics. That credit would really go to individual articles and individual authors. You wouldn't have to only do this through the indirect means of, did they publish in a very high prestige publication. You really might be able to -- when someone comes up for tenure on various computations, hopefully better than the ones that are done now, of counting the number of publications, and actually get various scores on how influential somebody was.
This last bit of credit for evaluating early, often and well. I have been thinking about -- and this came up yesterday in one of the comments, that the people who are doing reviews or having trouble getting people to do the reviews. I think we are not valuing that sufficiently in the system. Having some metric that didn't just say, yes, you were a reviewer or, yes, you were an associate editor at some journal, that actually said, this was a really valuable evaluator. We might start thinking about that as a contribution to research, rather than in the service line.
In the promotion and tenure reviews for academics, we always talk teaching, research and service. Serving on editorial boards usually goes under service. But if we think about this knowledge generation and dissemination process in the scientific community, the people who are doing this evaluation and commentary might really be thought of as contributing to the growth of knowledge. If we could get some metrics on how much they are contributing in that way, we might think about that as a research contribution rather than just a service contribution.
Thank you.
DR. ATKINS: Thank you very much, Paul Resnick. Now Richard Luce from Los Alamos National Lab.
DR. LUCE: Good morning. I was given the task by Dan to talk about pre-print service and extension to other fields. The first thing that hit me was, pre-prints is something that is a well-known, well-understood concept in the physics community, and in other communities it is met with either a puzzled gaze or some other sorts of reaction.
I would like to talk today a little bit about the physics community and pre-prints as kind of a community specific response, where we have been, some enabling infrastructure for where we are today, and then I'll look at a possible tomorrow in terms of recommendation systems for where we may be going.
To start, let me just put out a couple of definitions to keep things clear. Pre-prints clearly has this buyer beware connotation to it in the physics community. It is an informal non-peer review feedback that is weighted very, very differently in the community than a formal refereed report. It is basically the idea of, get something out to colleagues -- one understands the concept of that -- get some feedback, if any may come back, and I'll think about whether or not I want to publish that later.
E-prints on the other hand slowly seem to be accumulating this notion of authors depositing papers or drafts of paper, either, in some kind of an archive in order to speed up the communication process, thereby giving authors essentially control over distribution of their work, and saving the decision to downstream related to formal publication issues.
So with that distinction, let me just start by going back in time a little bit, back in 1991 with the arxiv or XXX that Paul Ginsparg created at Los Alamos.
That archive today has about 28 or so database archives or fields and sub-fields, 244,000 papers. Fundamentally, it has succeeded in large part because Paul is a physicist. He understood well as a high-energy physicist how that community worked, what its needs were. His notion was to take and streamline that communication process relative to pre-prints.
It does give the author control over both distribution and access considerations related to that kind of communication. More recently over the last decade, it has certainly spread into other fields, mathematics, materials, nonlinear sciences, computation, et cetera.
So first of all, let me dispel the notion that is a phenomenon only in high-energy physics or only in the physics community, and therefore can't work anywhere else. That is actually not the case.
It clearly has increased communication certainly in the areas that it covers in physics. It is the dominant mode of registering here is when my idea came out. It may be published six months later, whenever, but you can look back in time and look at that stamp, in terms of the system there.
Cost is very, very low. Paul likes to quote costs that seem to me a little on the low side, but certainly very, very low cost, and consequently wide acceptance.
The driver in the community clearly is speed, how do we make things move faster. It is my belief in talking with and interacting with a wide variety of both society and commercial publishers that it was in fact these kinds of examples that get out on the edge, that cause the rest of the community to begin to respond and say, perhaps we ought to move in direction, perhaps we ought to try to hold on to some turf that we have, and maybe we ought to move our model to electronic and so forth.
Clearly we see a trend here in terms of a continuing increase in terms of submissions. I think it is significant to note, in 1995-96, the American Physical Society began to accept pre-print postings, and later began to say we'll make a link back to those. So this was the beginning of a real formalized recognition that there was a role for this, call it bottom layer or first tier, in terms of information, and that we could have a two-tier structure and start to link those things together.
The question is always asked, so is there any value. These things aren't peer reviewed, so is it just a bunch of junk in there. If one does an analysis over a period of time of the quality of the submissions, what you see is a field specific track record in terms of what actually gets published.
High-energy physics theory, about 73 percent of the papers in the archive turn out to be published. In condensed matter, somewhere around a third or so. So it is fairly field specific, but it is an indicator that it is not just things that have no future or have no role in terms of the formal system itself.
What lessons can we learn out of that? The issue of timeliness certainly. A few passionate people can make a difference. This system for a decade while it was at Los Alamos was typically run by on the order of three, four, five people, sometimes doing far too many hours, but very, very passionate. This real sense of, this is really going to change the world. So that small number of people, relatively small amount of dollars, on the order of about a half a million or so in the year, became a very, very dominant thread in terms of the community itself.
Most importantly, I think the lesson is that it addresses sociology of a community of common interests, which is why the system worked for that particular community.
Scholarly communication however is a very complex ecosystem. Clearly, all fields are not the same. The sociology, the behavior, the traditions differ from field to field. Consequently, this solution is not the correct or only solution, nor could it be expected to fit in a variety of other fields.
The one size fits all argument in some cases doesn't apply here. I think the issue or the lesson ought to be, one needs to really understand the community behavior, the traditions, how that community works, and then look for models that meet those kinds of needs and requirements.
There has been spinoffs into other fields. Examples of other e-print systems, Cogprints out of University of Southhampton is certainly quite well known in cognitive science, ancestral, as an early effort to get computer science papers harvested together and then start to build a federated collection of computer science technical reports. Certainly NT LTD with Ed Fox at Virginia, and the idea of being able to scoop up a set of thesis dissertations and so forth.
The NASA system, National Technical Reports Server, was an early pioneer in terms of trying to bring together a collection of federal reports and make those available, both in terms of metadata and the full text. PubMed Central and E-Biomed, certainly very, very visible and well known in the life sciences community. Living Reviews, a little bit different model out of gravitational physics at one of Max Planck's institutes in Germany. This is the notion of essentially creating a review that gets updated by the author over time. So rather than going back to 1995 and reading a review that is static and wondering what is happening to the field, authors who publish in living reviews commit to a process to try to keep the material they put into that online publication up to date.
And we have seen spinoffs in areas like economics and so forth.
If you count the number of open archive initiative compliant servers, you get around 100 or so people who say we have got some kind of an e-print system, we are going to use a standardized protocol, and we are going to allow people to come in and at least harvest some segment of what we collect here, harvest at least the metadata of that segment.
Unfortunately, we only count about a dozen or so service providers today. If we look at the problem space, it is an enabling infrastructure. Clearly the open archives initiative was not meant to be the end-all, be-all, fixed to the system. It was really meant to say we need to look for a solution that allows a discipline specific e-print archive to be able to talk to or communicate with other systems, so that one has the opportunity to go in and look at a pool of things. So we have a variety on the bottom level, a variety of different representation of different systems, and the problem is, how do we get access to this.
The protocol specifies the method by which things can be harvested. We are just reaching the point where we are starting to see people talk about, that is fine, now that we have this data, what kinds of interesting services can we put on top of that. That development has been slower than I thought it might be, but beginning to take off with a variety of different systems.
One example, Citeseer. As Paul Resnick mentioned, you can see -- this happens to be a submittal on archive, and you can see both the citations and what is happening over time. Again, real time. So this begins to hint at the kinds of things that in an open environment people might do with the service level related to no AI collection.
So what does this mean beyond physics, and what new efforts can we see? I want to note that there has been incredible opposition. I remember going to conferences back in 1992-93. High-energy physics is limited to that community. The next thing was, it will never work outside of physics. It will never replace online journals. But the system continues to bubble up and bubble up.
There is very powerful opposition coming from very traditional parties who have used the journal publishing either as a cash cow or secondary providers who see their secondary databases as essentially a birthright. I think unfortunately, we saw a lot of political pressure, which created the demise of PubScience because it began to threaten those kinds of interests, and there was a lot of talk about, we need to go after some other targets now. I suspect that may happen or may not happen, but essentially kind of a political track to hold onto the economic value proposition, or the economic position, I should say, that a number of companies have.
MIT DSpace, which I think we will hear more about, the European effort, Figaro, are some examples of models that can either co-exist with the current system or help the current system evolve into something better able to meet the needs of researchers.
I think it is very interesting to me to listen to this dialogue over a period of about a decade, and people talk about this system won't do this and won't do that, and there is very little discussion about, what does it do for the end user, and how do we evolve the system from the perspective of the end user as opposed to all the other players in the value chain.
So sometimes in this complex chain, I think we lose sight of to me what is the most important to mention, which is what is it we offer the end user in terms of what we are doing with the system.
If we think about the peer review system, I'm not going to take the position that peer review shouldn't be done. Clearly there is an issue related to how do you -- in the new world, how is something like peer review or how is quality assessed.
In the new world, we have this problem of quality. Rather than having a snapshot that someone takes, and sending that snapshot out for people to take a look at and make some judgment about, what we have instead is a movie stream or a video stream. So we have a very dynamic environment.
One can think about the issue where we have somebody reading something in PhysRev, letters, decides to do an experiment. Out of that experiment comes some simulation code. It is put on a server. That server is hit by a number of other institutions. People in those institutions decide to modify the code, re-run the experiment. Pretty soon you have a chain of people who are interacting with a phenomenon, trying to understand what is going on. At a given point in time, you have a different understanding of what that phenomenon looks like. You also clearly have a set of players who are responsible for in a sense the output of those ideas and how they get communicated. So rather than a snapshot and an article, what you have is this video stream, so it gets very, very difficult to both respond quickly and decide who is it and what is it that we are going to make some decisions about. That is what I am calling compound invoking documents.
So I think the real question that we are struggling with today in part related to the peer review question is, what is influence, and how do we detect influence. I think there is a variety of methods that one needs to look at, essentially a composite. Today we use citations as the sole indicator of influence. That is an author-derived statement about what is important, what has influenced my work.
I want to suggest that we might look at, at least a complementary path, which is the notion of reader behavior related to determining influence. I want to posit the idea that digital libraries or service providers can provide analytical tools to generate new metrics based on user behavior, which complements or may even surpass citation ranking and things like impact factors.
What is the problem with impact factors? To my mind, it is the lazy person's notion of how to figure out what is important in terms of journal ranking. It is very convenient for publishers to say my journal is ranked such-and-such in terms of impact factors. It is relatively easy for a librarian to justify this is why we buy this title instead of that title.
The problem is that the citation is only an indicator of influence. There are many reasons that people might cite a paper. I want to show that I have read the literature in a field. I want to disagree with somebody and prove them wrong. I've got a friend that I've got to make sure he gets some visibility and enhances reputation, or there is generally a good idea out there that I want to credit.
So impact factors are widely used to rank and evaluate journals. They are often used inappropriately, in my view. Then there is a whole field of bibliometrics, which tries to look at a more complex environment of authors, citations, journal citations and the subjects that are covered. In my view, still a fairly emerging field, but one that we are going to see take up more and more of a drumbeat in this area.
What would a multidimensional model look like, in terms of thinking about something in addition to citations? We have also thought a little bit about the Google model and said, it has some limitations also.
An ideal system might have the following things. You might look at citations and factor that in. You might look at co-citations, determine the nearness or proximity indicator. You might look at the semantics or the content and the meaning of the content in articles and see how they are related.
Finally, you might look at user behavior in terms of traversal paths. By traversal paths, I mean the following. I start off in the morning, I read a report say in the laboratory, a Los Alamos report. From there, I see a link. It refers me to an article in Science. From there I click on something and it takes me to JBC.
Statistically there is probably some relationship then, if those things are done in relatively short periods of time, between the government document I looked at and that JBC article that I have read. If one agrees with that premise and statistically starts to look at how frequently are those kinds of things connected together in a session, in a reader session, and how frequently are those occurring within a community, one starts to see a behavior pattern that I think can suggest things.
So in terms of social navigation, we are currently experimenting with a system that allows us to drive metrics at Los Alamos. We are able to do this because about 95 percent of what we have is electronic only. We can detect and determine community specific research trends, and we can look at where those trends differ from the ISI impact factors.
So out of this, I believe that we can develop some formal and informal hybrids, looking both at the e-print bottom layer and a higher layer of things that finally get published. We have got the issue of how do you deal with trans-disciplinary science, where things start to collide together and don't have a good answer for that today.
Finally, we have got the problem of long term curation. I want to put that problem in the context to finish three pieces of that. You have got whatever it is that is published or out in the literature, and that is the thing that when people talk about preservation, they think about. But secondly, you've got the issue of the relationship, let's say a rich linking environment related to that. So that is the set of things that you would like to preserve and be able to represent over time as well.
But thirdly, you have the whole question of the patterns of behavior related to those things, and that is something you would like to preserve and collect over time, and make available in terms of a curation perspective as well.
I think I am getting the hook, so with that, I thank you.
DR. ATKINS: Our final speaker is Hal Abelson from MIT.
DR. ABELSON: I was sitting yesterday and this morning, listening to Ted Shortliffe and Bruce Alberts and Dan this morning talking about how this panel was supposed to be about the future, this wonderful cyperinfrastructure future. I was reminded how William Gibson, the outstanding cyberpunk writer who 20 years ago gave us the word cyperspace. He said, he doesn't like to write about the future, because for most people, the present is already quite terrifying enough.
It is in that spirit that I want to talk about the present. I hope we are all here agreeing that what we are trying to do in this publication process is to promote the progress of science. And of course, what is happening, as was already said this morning is, the elements of that publication are starting potentially to disaggregate. We heard Paul Resnick and Rick Luce talk about some technologies for review, but there are lots and lots of other things where technology can come in and allow different kinds of players. And of course what is happening in this cyberinfrastructure is, we are now all engaged in this cyber videogame called, dis-intermediate thy neighbor.
The main thing that I would like to say is that in this present, the action you should watch for is not new technology, because it is old technology. The action you should watch for are new players coming in and finding institutional reasons related to their other primary missions to participate in this new game of dis-intermediate thy neighbor. In particular, I want to ask the question, do universities have institutional roles to play here other than what they have been so far, which is the place where the authors are. So do universities have a reason in their institutional missions to start participating in this.
If you look at MIT's mission statement, what you will see, and I'm sure lots of other places are like this, that MIT is committed not only to generating, but also to disseminating and to preserving knowledge.
How does a place like MIT or any university think about its mission to disseminate and preserve as well as to generate knowledge? One is getting pretty famous. About two years ago -- and here you see a statement from MIT's president's report, that says MIT has made an institutional commitment to take the primary materials that we use for our students in classes, that we create for our students in classes, and put those up on the Web for free open access by anyone.
The reason we did that is not that we were overcome by some fit of altruism; it is that we decided that given the way the world is going, it would be better for MIT in terms of fulfilling our primary mission to educate our students, if MIT and Stanford and Berkeley and all research universities and all universities put their primary educational material on the Web. It would be a better world for us.
Here is the open courseware Web site which you can go to now. It is currently just a prototype and has 50 courses up. There is a group at MIT which is madly trying to get up the first 500 courses by September. They are on a timeline to get up all MIT courses by -- I think it says in the small print there 2007. You can watch and see how we are doing, but the point is, MIT has made this as an institutional commitment.
When we started this, people said, sure, there are all sorts of courses that got Web sites up and people can access them. But in this audience, I need hardly emphasize the difference between lots of course Web sites that happen to be up and maintained by faculty members and an institutional publication process, that has committed to that as a permanent activity.
People also mentioned DSpace, which is the sister project of OpenCcourseWare for research. DSpace is a pre-publication archive for MIT's research. Again, we just heard Rick talk about pre-publication archives; there are lots and lots of them. The difference with DSpace is that there is an institutional commitment by the MIT libraries, justified in terms of MIT's mission to maintain that.
OpenCourseWare would make sense if only MIT did it, but DSpace can't possibly be like that, so DSpace equally to being a pre-print archive for MIT is meant to be a federation, which collects together the intellectual output of the world's leading researchers. We have six partners who are working with us.
Again, the importance here is not that there is a piece of software and some database that does it, although I should put in a lug for that. There is a very good pre-print server system that is very robust, it is available on the Web by open source and has already been downloaded by about 2500 places, but the important part is, there is a group of universities who are working out the management and sustainability processes in terms of their institutional commitments for how you would set up a federated archive like this. It is getting a lot of press.
Now, both open courseware and DSpace are ways that MIT and other universities are asking what should be their institutional role. Do they have an institutional role to play and how can they play it in disseminating and preserving their research output. The question is, why? Why are these questions coming up now? Why might universities start wanting to play institutional roles in the publication process, other than as places where authors happen to be?
The answer is that the increasing tendency to proprietize knowledge, to view the output of research as intellectual property is hostile to traditional academic values. What are some of the challenges that universities see?
I'm not going to talk about cost because that has already been talked about. Most people here know a lot more about it than I do. But I want to come back and repeat some of the things that have been said yesterday about the arbitrary inconsistent rules that universities are supposed to deal with, the impediments to developing new tools for research, and the risk for monopoly ownership.
So let's review -- Jane Ginsburg and Ann Okerson already did this yesterday -- the basic deal as seen by universities. The basic deal is the author, scientist, give their property away to the journals. The journals now own this property and all rights to it forever. Lifetime of author plus 70 years is forever in science. If that regime had been in place 100 years ago, we today would be looking forward to the opportunity in 2007 to get open access to Rutherford's writings on his discovery of the atomic nucleus. This is forever.
Then what happens is, the publishers now take their property and magnanimously grant back to the authors some limited rights that are determined arbitrarily totally at the discretion of the publisher. The universities, who might think they had something to do with this, generally get no specific rights at all, and the public is not in this discussion entirely.
So Jane Ginsburg has already yesterday showed us some nice examples, but let me just put them up again. If I give my property to Elsevier, they magnanimously grant me the right to make photocopies of my own articles, or to present my paper at a conference. Thank you, Elsevier. But I shouldn't beat on the commercial publishers.
Yesterday we heard from the American Chemical Society, which were I a chemist would magnanimously grant me the right to give my paper to not more than 50 colleagues -- I suppose chemists aren't as gregarious as computer scientists or something -- and to post not the text of the paper, but the title and abstract on my own Web site.
But of course, both Elsevier and the ACS are amateurs in this game. When you compare it to a place like the New England Journal of Medicine which we heard from yesterday, which grants to authors exactly no rights, bounded only by fair use law in the United States.
Ken Anderson when I pointed this out to him yesterday told me, that is nearly the journal's policy, not its practice. And they are changing their policy, which is something I applaud. Unfortunately, another part of their both policy and practice is the Inglefinger rule, which says they will not accept any paper that has already appeared in a pre-print archive.
Now, why are universities supposed to accept this deal? Well, because publishing is a serious business. This is a quote from the Nature debate that I absolutely love. Notice, the process either is a stewardship of the journals or unknown individuals. Isn't that great? Unknown individuals. And copyright should not be ceded to individual authors. Where did that copyright come from in the first place? This is an amazing statement.
Why are we doing this? We are promoting the progress of science, and surely quality publication and integrity is important for promoting the progress of science. But there are lots of other things that we could use. There are lots of other things that go into promoting the progress of science.
Paul Resnick already mentioned Google as a research tool. If I would like to know about the HOX-8 gene in zebrafish, I can go to Google and I can type that in, and I can get all sorts of references, unfortunately, not the good ones. Those are locked up behind electronic walls.
I was very encouraged to hear Gordon Tibbitts from Blackwell tell me this morning that they are thinking about a way to allow Google to index peer reviewed literature. But it is not only indexing.
Here is a great little research tool that you can get on the Web. Someone went and made a system that does a concordance of any book in Project Gutenberg. I did some research on Turn of the Screw, which you remember from high school is this wonderful book about evil by Henry James. You can type in the word evil and say, show me -- isn't that amazing? The word evil appears only seven times in Turn of the Screw. If I go to any one of those, I can see the context of it. It is a marvelous research tool. In high school, term paper heaven. I can get a concordance of Project Gutenberg; I cannot get a concordance of the communications of the ACM. This is easy technology.
The important part is that this technology was made by somebody who just did it using public tools. It is easy to make a concordance. What is hard is to get across the electronic fences if these things are not available with open access.
So the question that I want to ask is, are these tools going to be stillborn because everything is hidden behind fences? Probably not, because the stuff is valuable, publishers know it, people are going to invest in it. The more serious outcome is that the spread of these tools will be done in such a way to stimulate network effects that will further concentrate and monopolize ownership of the scientific literature.
You heard about it yesterday, right? If I make a search engine that talks to only the publications from one publisher, that becomes valuable enough for the publisher then to come in and do what you librarians were talking about yesterday, the big deal. This cycle goes on and on and on.
in case you think I am being paranoid, here is one publisher's view, which you might want to read. We will give scientists desktop access to all the information you need, made available to researchers under licenses according to their institutes. If I were less charitable, I could characterize this as megalomania, but in fact, it is a modest little statement made by a modest little company that in 2001 got ten billion dollars of revenues, of which $3.5 billion was profit.
So the question is, are we headed for a place where the scientific long term is restricted with monopoly ownership, or a system that participates through open standards? Gordon Tibbitts talked yesterday -- and I absolutely applaud this -- about Blackwell's support or recognition for the need for open standards. Let me just mention that one impediment to openness is copyright. You have already heard about Creative Commons. One of the issues about copyright which the panel yesterday didn't mention is the default. If you do nothing when you put something on the Web, that is copyright with all rights reserved to you. The assumption that anyone has to make in coming in to that is, they can't use it.
It turns out to be surprisingly difficult to -- in the wonderful legal phrase -- abandon your work to the public domain. That is hard to do. It is even more difficult to specify some rights reserved, not all rights reserved.
This is what Creative Commons is about. We started this as an effort to encourage people to allow controlled sharing of their work and to effectively brand this kind of summarized reserve on the Internet. They are currently a tiny percentage of the Internet which is using this, something like 250,000 Web pages today, but hopefully that will grow.
Pat Brown showed an example of a Creative Commons -- this is an example of a Creative Commons agreement written in a language for human beings. We also have the same thing in the language written for lawyers. And more importantly, the same thing written in a language for computers, so that a search engine can come around and say, can I make derivative works on this thing.
Let me just finish, because the hook is coming out, and say, the world is disaggregating. There is a big game of dis-intermediation going on. The place to look for access is not new technology. This is old technology. The place to look for the action are new institutional players coming into the game. New technologies might do that for universities. I hope that happens, and I have faith that if that does, that will lead to the promotion of the progress of science.
Thank you.
DR. ATKINS: Thanks to all our speakers. We will open up the floor now for discussion. Before I do that, I want to take not more than five minutes and offer up an opportunity to any of the panelists to add comment or question one of their other panel members. Anyone have something they would like to add at this point?
Let's bring on the questions from the floor and from Web cast-land. Let's begin here on my left. Please state your name.
DR. BERRY: Steve Berry from the University of Chicago. There is one aspect of the refereeing process, Paul Resnick, that I think you have dismissed, overlooked, that could be done in other ways. Reviews by and large have a lot of influence on what actually gets published. It is not a simple rejection process. Of the papers that are reviewed and published, a very high percentage are revised as a result of the reviews.
Furthermore, we have to recognize that the reviewing process provides only a very low threshold. It simply says that this material is of a quality that it is worth scientific discourse. So it is not a judgment of whether it is right or wrong. It is only as I say low threshold.
Now, these could be done in other ways, of course. But I think that we have to recognize that in some fields, reviewing even at that low threshold is looked upon as a very important protection.
I have listened to arguments between physical and biological scientists. Physical scientists are as Rick Luce points out very ready to accept the archive model without review first, and online reader review. The biological scientists are worried that without that low threshold review, things could be put online that would be dangerous to lay users. They feel that there is a large audience for biological and especially biomedical articles that simply don't have the judgment that the professionals have, and that it is simply dangerous.
I'm not sure whether I buy the argument, but it is certainly one that one encounters.
DR. ATKINS: Anyone like to pick up on that?
DR. RESNICK: The first thing is, I just want to make sure I emphasize this. I am reviewing these alternative review mechanisms. I think of them as and peer review, not or peer review.
I do think that maybe the peer review -- we could get more out of it than we are getting now, even just the existing peer review, if you made the reviews public. For example, you don't want to let something get out there for the general public to see without anybody from the scientific community saying this is rubbish. Fine, put it out there with the two scientists who said this is rubbish. Why is it better to just hide it than to put it out there with the commentary from the scientific experts?
DR. ATKINS: Any other panelist comments? Let me take a suggestion question from the Web. Referring to the previous talk, and I'm not sure which one precisely that is, might not the thread of the original version of a paper, along with reviewers' comments and authors' revisions or responses to the comment, defense, results, methods, et cetera, as well as the journal editor's additional comments and interpretations, be used as educational material/enrichment for university students? This could be done anonymously.
So it is kind of a suggestion idea. Anyone like to comment on that?
DR. RESNICK: I think it is a great idea. Clearly one of the problems for PhD students now is that they don't ever get to see a paper through its whole process until they do it themselves and cleared it the hard way. So an advisor can show them, here is the reviews that have happened to my papers, but not all advisors do that. I think having it a more public process would be helpful for education.
DR. ATKINS: Rick Luce, did you have something to add?
DR. LUCE: The question always comes up, what is the correct balance between noise and some real value in that discussion, and should the system filter it out or should you let the user filter it out. It is debatable as to where you draw the line there.
DR. ATKINS: On my right.
DR. CAMPION: Ted Campion, New England Journal of Medicine. First, just point of fact. As was alluded to, our practice is that all scientific reports are public, free to anybody that has Internet access, and it is all indexed by Google now. We have been getting that done with the help of HighWire.
But my question concerns your discussion of new players in scientific publishing. I think it seems that one big player that we are seeing has barely been mentioned. That is the public press. Scientific publication, particularly in biomedicine, is largely being judged now, at least by authors and I think even academe, by how much press coverage it gets. It is not just studies of estrogen. Zebrafish and hedgehog mutations are getting into the New York Times, are getting into the press.
One way we are being judged now by authors isn't just boring citation indices, but did Peter Jennings cover it. What effect do you see -- and this of course is all being driven by the public's need and increasingly sophisticated view and interest in science and biomedical sciences in particular, but quite broadly, and what effect do you see this having on scientific and biomedical publication?
DR. ATKINS: Hal Abelson, do you want to start?
DR. ABELSON: I am kind of reminded -- I had a discussion with my daughter, who is in medical school, about MIT's OpenCourseWare, and she is saying that would be a terrible thing for medical schools to do, to put their course curriculum on the Web, simply because you had to be a professional medical student in order to evaluate that, and it would be dangerous to have that information out there.
I don't quite know what you do about sensationalism. I think it has gone through so many aspects of our society, and with many -- you could do a lot worse than have Peter Jennings talk about something in the New England Journal. It has been a tradition in the United States that the cure for speech is more speech. Maybe if there are other channels for people to respond, things would be better. But it is very hard to imagine a path that says we should restrict that in some way.
I wanted to ask you, by the way, is the New England Journal thinking of revisiting the Englefinger rule?
DR. CAMPION: We publish scientific reports. If we judge that something has been published before, we don't republish it.
DR. ABELSON: So maybe part of the answer is that it is up to the press to worry about the novelty and up to the journals to worry about the authenticity.
DR. ATKINS: Any other comments from the panel on this question? Thank you. The back microphone, on this side.
DR. DOYLE: Mark Doyle from the American Physical Society. As one of Paul Ginsparg's first employees through Rick Luce, I guess I am one of the few passionate people that Rick mentions.
Now I work for the APS. Like MIT, part of our mission is explicitly the advancement in the fusion and knowledge of physics. So during yesterday's panel, I was surprised when the question was asked of the first panel, what do publishers do that they think they excel at.
Everybody on the panel just said peer review. They didn't really focus on the other big thing that we at the APS think is very important, which is the responsibility to do the archiving and things like that. We have already gone through pretty much the transition, going to fully electronic. The core of our archive now that we consider the primary output is a very richly marked up electronic file from which print and the online material is all derived. That is the thing that we would like to be able to curate.
What I don't see in DSpace or in the e-print archive yet is any efforts to build that infrastructure for curating that kind of material. It is one of the most important things. The simple question is, here at the APS we are a well-intentioned scholarly society; why don't we just make our journals and overlay on archive or something like that. Effectively, in some ways they are.
I think there are two things. One is, peer review we still think is extremely valuable, and the other is this archiving thing. We seem to be one of the few places that is actually able to curate these things, we being the publishers. An exception would be NLM, which I think we will hear about later, where they have taken this approach to building archives of XML and things like that.
So I was just wondering if people in the panel could comment on that aspect of doing the preservation.
One other thing I would like to comment on is, the low cost of arxiv.org. The key problem here is that there is a two to three order of magnitude difference between what it costs the APS or another publisher, that $1500 per article and what it costs to do dissemination in the archive. That is really where all the tension in the economic models come from. When you have that large orders of magnitude difference. That is what puts all the pressure on people to change the way that things happen.
That's it.
DR. LUCE: Two threads. The first thread is the archiving question. It is an interesting question. I don't think that libraries generically are going to be able to pull it off. I do think that there are probably a dozen to two dozen libraries globally that would see that as an important role, a significant thing to make an investment in, because they are thinking about centuries, not years or decades. That will move forward.
There are some publishers who are quite aware of the problem, as you are, and making investments. There is a vast majority numerically of publishers who are just simply too small to have the wherewithal, both technologically and financially, to be able to pull it off.
So it is going to take some sort of a hybrid relationship between some publishers and some libraries, a small set of libraries, who see that as their role.
Your second question was I believe related to cost?
DR. DOYLE: It wasn't so much a question as a statement, the fact that there is a two to three order of magnitude difference in the models. But those costs need to be recovered if we are to preserve the two important things that the APS does, which is the peer review component and the archiving component.
Actually, I do have a way to turn that into a question. For that $1500 per article, would libraries be able to actually -- since the large part of that $1500 per article is going into producing that archival XML file, would libraries be able to then take that -- we would be able to give back to the libraries the thing that they paid for, but would they be able to curate it correctly, and is that an incentive for institutions to pay that $1500 per article, rather than authors, if it could feed more richly structured data into their institutional repositories.
DR. LUCE: I haven't looked at how that would scale. The issue that I wanted to respond to related to cost -- and I think Mike Keller touched on this yesterday in his talk -- some very small number of publishers have actually started to rethink and make progress in terms of how production occurs. Therein lies some cost-saving opportunities.
The large majority of publishers that I am familiar with are essentially taking a paper model, electrifying it, it is an analog, and so they are saying we have these enormous costs, without really looking at how do you need to produce this in a different environment. Until they make that flip, it is very difficult to talk about what costs actually ought to be as opposed to what they are.
DR. ATKINS: Hal Abelson, do you have anything to add?
DR. ABELSON: Yes. When we designed the DSpace, it was completely, absolutely essentially deliberate that this was housed in the MIT libraries. The reason is that whether or not the MIT libraries will preserve something for 200 years, they sure as hell will preserve it for 50 years. We wanted to work with an organization that understood what archiving meant, and also what curation meant, because that is what libraries do.
So just to build on what I was saying earlier, the critical thing is not the technology that one can put up and archive someplace, because that technology is well under control. The critical thing is to find an institution that will say, as part of our core mission, we are dedicated to keeping this around and preserving it.
If it is part of the core mission of the ACS to be the repository of all chemical literature and have every other organization in the world be your franchisee, that is an important thing to say.
The other thing I wanted to say. I completely agree with you, it is very important not to get trapped into the idea that you are going to be building the monolithic, end-to-end solution. So DSpace will never be that. DSpace might be a place where people building peer review systems and curating systems and all kinds of other systems can link to and build on.
But the trap, and this was alluded to yesterday by Blackwell, is that you don't want to be in a position where people build the whole system. You want to be in a position where there are elements that are communicating through open standards, so lots of people can come in at different places in the value chain and add value in different ways.
DR. ATKINS: We'll go to the side microphone.
DR. MC HENRY: Bruce McHenry. Bruce A. McHenry if you are Googling me. You will find my Web site at discussit.org. Right above the reference to Discussit, you'll find a reference from one of the LCS lab computers that says, Bruce A. McHenry, I am a schizophrenic foosball. If you have been at foosball at MIT, I guess you have a right to be schizophrenic.
Basically, I am working to found a company and raise several million dollars over the next few months to build a layer of protocols on the Web which will allow for annotations associated with credits and debits. From those credits and debits will flow reputation information, which will help to promote or demote pieces.
I think it is going to be part of basic Web infrastructure and an operating system with network effects and lock-in potential, that needs to have a significant discussion about how much of it should be publicly funded and how much should be privately funded.
But initial markets are not going to be academic publishing, so the New England Journal can rest assured that they won't be invading your space right away. The initial markets will be areas where people are paid a lot of money to do what they are doing, which is legal investment banking, consulting, writing software, and eventually it will trickle down into academe. But academe will be at the leading edge of the rise of the S curve, and so that is why I am here, and that is why I stayed at MIT for graduate school.
I want to pick up on a thread that Hal started with the slide about changing the basic deal. In academe, the basic deal is, you get to have the keys to the universe, know the secrets of how the world works. In exchange for that, you take a vow of poverty, often.
That basic deal does not play very well to mothers around the world. Mine is asking me every time I talk to her, have you made any money yet? It doesn't play particularly well to adolescent girls, where the choice is between Britney Spears and maybe learning science.
So changing the basic deal could start at the top. Paul Resnick mentioned teaching, research and service as the three missions of faculty. I'm not so sure about the way the service mission is being performed, and I'm not so sure about the way publications are being run by boards of peers, selecting articles for submission, for publication.
I think actually, the process of selection is probably inverted, in a sense, because the articles which are of most interest to the audience, the audience knows about, not necessarily the people who are on the review panels. So the selection process should probably start with the audience and then be raised up to the experts in the field for corrections, embellishments and inclusion in the historical record.
So changing the basic deal to me in academe would like something like this. Instead of being awarded research dollars to go off and do research, or maybe in addition to having research dollars to go off the do research, you get substantial amounts of money to give away to others whose research you deem useful and interesting to you. This could apply even to students, that a significant portion of their tuition be given to them as money to be used in the system to buy the work of their peers and also the faculty and other experts in the society.
That changes the whole model to one which is much more monetarily driven, probably. I know there is a great deal of resentment towards any kind of suggestion of that among academics. However, one only has to look at the example of the Soviet Union to see that monetizing things depoliticizes the process of creation, and greatly improves the quality and quantity of the content.
DR. ATKINS: So is there a question there? Paul Resnick has a comment on that comment.
DR. RESNICK: Just the fact that you are starting the company makes me want to ask a question about other institutional players who might be coming in besides universities. Do you see anything else on the horizon? Are there more Googles? Should we be expecting General Motors in this realm? Is it only universities that are the new player here?
DR. ABELSON: I think from the point of view of libraries, the interesting shift is happening at universities. Traditionally, the role of a library at a place like MIT has been, bring in all that stuff from outside to support the research going on in the institution. The interesting shift in something like DSpace is, the library is saying maybe we have another mission or a slightly different mission, which is to effectively be a place where the research from the institution is disseminated.
That is kind of the message that is going on in this DSpace federation, of universities playing with the idea of, does it make sense to view their mission in a very different way.
Now, I haven't actually seen -- I have talked to people in research libraries from large companies, and I haven't seen that as a theme that people adopt, although on the other hand, you can imagine a library at a place like IBM looking at itself effectively as a piece of the marketing arm of a company. If you go to IBM or Cisco, there is tremendous resources that you can get there, but they are not quite seen as a link with the research libraries of that institution.
So I can imagine that sort of thing happening, but again, the point is, there is lots of room for many, many different players to start looking, as I said, not at the whole thing, but pieces of the thing. You can imagine a certification place, you can imagine a place that does professional peer review. These opportunities start coming up.
DR. ATKINS: We are going to start here and then we'll go around.
DR. KING: Thank you, Dan. Donald King from the University of Pittsburgh. Looking at the amount of use, the types of use of articles that are written in science and medicine, about two percent of those articles are used for citation purposes. About 25 percent of those articles are read by academicians, and about 75 percent of those articles are read outside of the academic community.
What I am suggesting is that when you begin these feedback systems, that you think in terms of the enormous value that is derived outside of the academy. There are two purposes for doing this. One is that it is a better metric. It will achieve a better metric for assessing journals and authors and all that kind of stuff, but it also will begin to develop a means of the authors recognizing that their audience is outside of their peers.
I have done a lot of focus group interviews and in-depth interviews of faculty, and they seem to think that they are writing only to the people they know, their immediate community. I think there would be some value in the system if there was some kind of an acknowledgement or recognition that there are other uses of that information outside of the Academy.
Thank you.
DR. ATKINS: Any comments? Go ahead, Paul Resnick.
DR. RESNICK: I think that is good. I think you have a chance to start getting some of this feedback and also usage data from outside the academy.
As you point out, feeding that back to the individual authors, it is not just that you want other people to be able to evaluate whether my stuff gets read; there is actually very little feedback for authors about what is happening with their works.
DR. RHINE: I'm Lennie Rhine, University of Florida. I have no project to promote or position to defend, so I just have a question. How do universities in the tenure process adopt to the open source environment?
DR. ABELSON: Could you say a bit more? I'm not sure what you mean by that.
DR. RHINE: Well, within academia, I think that it is driven a lot by the tenure process. I think that most academics hold on to this as a mechanism to rank hierarchically, to evaluate, et cetera. So if you are going outside this vehicle of peer review journals which is used so heavily in the tenure process, how do you incorporate this more ephemeral open source information into that process? Does that help?
DR. RESNICK: I'll just tell you what Hal told me before the panel, which is an idea for collecting a lot of these metrics. Right now in the tenure process, they get the journal rankings and they count the number of journals, and different departments weight it differently. They may or may not construct a numeric score, but they all have it in their heads.
You might actually have an open system for computing metrics like that. Think of how U.S. News & World Report does their rankings of schools; they have a particular way of weighting everything. But now, imagine a more open version where we collect all the data, we know what things have been cited, we know what things have been read. We have all the reviews, we have both the behavioral and the subjective feedback. Then you let the teaching institution that has its certain tenure criteria create its own metric based on that, and the research institution will create a different metric, and you can have lots of different ways of using that data.
DR. ABELSON: I think specifically with respect to the tenure process, one of the marvelous things that the NSF did several years ago was to limit the number of papers that one can cite in preparing a grant proposal. It is not really a question of quantity or vast numbers of papers published in third-rate journals.
One of the things that we have been trying to do at MIT is effectively to point out when you come up for a tenure case what are the three things on which I should be evaluated, and to try and get this whole enterprise moving more in terms of what Alan Kaye used to call the metric of Sistine Chapel ceilings per century, rather than papers per week.
DR. MOLHOLT: Pat Molholt, Columbia University. I just wanted to offer an example that I think is a hybrid of a hybrid, putting together some ideas from Abelson and Luce, that is active at Columbia at this point.
There is something called CIAO, Columbia International Affairs Online. It is from within the libraries. It is a publishing arm that acts collaboratively with peer institutions to assemble material in sort of a pre-print way. It is technical reports, reports of conferences, other gray literature.
It is then packaged along with other aspects, some news reporting and some more ephemeral material, and it is packaged back out for sale to libraries. So it comes from the library, it is packaged out and sold to libraries, but it also has an element within it that is freely accessible for high school-junior high school teaching that is public and open and can be accessed by anyone. So it picks up on pieces of a number of areas.
That is in international affairs. They are doing one now also in earth sciences, and we are contemplating one in alternative medicine.
DR. ATKINS: Thank you. We'll go to the back microphone, then the front.
DR. BLUME: Marty Blume from APS. First, I'd like to make a cautionary remark about metrics. In fact, I think these have been addressed by Paul Resnick.
I was happy to hear Rick Luce use the word indicators rather than metrics, because there is no one number that can be used as a measure of quality. There are things wrong with all of them, and there was an excellent list of gaming and the like that was put up there.
There is a very nice Dilbert cartoon, for fear of violating their copyright, but it does show the human resources manager saying metrics are very important, a very good one is the rate of employee turnover, and the reply by the manager is, we don't have any turnover, we only hire people who couldn't possibly get work anywhere else.
Many of the metrics suffer from this, and they can be manipulated. You really have to inquire into them and use them as indicators, and it takes a fair number of them if you are going to get a fair measure of quality.
I wanted to also comment on peer review. One of the things that one has to worry about in the case of public comment, although I approve of that, I think it is a good thing to do, nevertheless there is a sort of Gresham's Law of refereeing, in that the bad referees tend to drive out the good ones. All of us who take part in listservs of one sort or another know of the loudmouth who will not be contradicted or denied, and eventually the rest of us give up and say we are not going to take part in this anymore.
So you do have to expect something like this, and you have to have a degree of moderation in it. Lo and behold, what is that but an editor. So this is another piece of the peer review process.
Also, the knowledge that a paper is going to be peer reviewed does have an effect on an author. It means that they try to improve it at first so that it will pass this barrier.
I do have some statistics on peer reviews from our journals. We generally look -- this is a matrix; we look at the first 100 articles submitted and track them through in a year for one of the journals, and track them through to see what has happened to them in the end. You can imagine these having been put up -- all of them having been put up on the e-print server, and then follow them in that way.
But of the first 100 submitted in one of the years, 61 were accepted and the remaining 39 were rejected or recommended for publication elsewhere. Of the 61 that were accepted, 14 were approved without any revision after one report, 22 were approved after resubmission after some modifications, 14 after a second review and four after the third resubmission, all of these leading to improvements. Some wags say that the improvements for some of them are largely adding references to the referees' papers, but even that is an improvement if there is not enough citation of other work, something that we see now. Of the rejects, 19 were rejected after one report, 13 after two, four after three, two after four and one after six reports. It is much more costly and difficult to reject a paper than to accept it.
But this gives an idea of some of the things and some of the value added in the course of traditional peer review which I know was not under attack in this case. It is something that has to co-exist with the other pieces.
We tried to avoid using referees of the type that would lead to Gresham's Law that we see on the listserve. We are aware of them, and try to select accordingly.
DR. ATKINS: Thank you.
DR. RESNICK: I have one response and then I have one question back to you. I'll ask the question first. Do you know what happened to the 39 that were rejected? Where did they go after that?
DR. BLUME: No, we don't know. We do have some measure of this. In our letters journal we have found that in our letters journal, Physical Review of Letters, the acceptance rate is 35 percent. Many of the papers are rejected. It is not an indication of low quality, because there are other criteria, including the breadth of interest of the article, which is rather more subjective. But we have seen in another letters journal -- one of our editors has gone through this, looked at the titles, and found that a full 40 percent of the articles in that journal in the first six months of the year had been rejected from Physical Review of Letters. Otherwise we don't track them, and we can't tell.
DR. RESNICK: One of the reasons I ask about that is, just as there is a confirmation bias that we only publish experimental results where the hypothesis was confirmed, we also only let people know when a paper has been accepted; we don't let people know when it has been rejected.
I do want to comment on the Gresham's Law of reviewers. One way that you can evaluate the reviewers is to have an editor or some person in charge who picks them or moderates. But you could come up with some systems where you calibrate reviewers against each other.
DR BLUME: We actually do this. We keep a private file based on the reports that we receive. The individual will give us a detailed report, a reasoned report. It is something which says publish or reject, which we largely ignore, that sort of thing. And unfortunately, this leads to an over-burdening of the good reviewers, so you are punished for the good work that you do.
DR. RESNICK: No good deed goes unpunished. I would just point out that that kind of system that you are using privately and internally, you could imagine some version of that in more public systems as well. So if you go to a more public system, you don't necessarily have to have all of the lousy reviewers get equal voice.
DR. BLUME: If we do that, we would certainly want to pick people, and would probably continue to do it anonymously. A reviewer is always free to reveal that that individual is the reviewer. We will not.
DR. ATKINS: Front microphone here.
DR. FRIEND: Fred Friend, Joint Information Systems in the U.K. I'd like to ask Hal Abelson how he sees the future relationship, long term relationship, between his tutor repositories and traditional publishers. I can see that repositories have a very valuable role in shaking up the system and in helping us to establish or go back to better priorities in scholarly publication, but what is the role for traditional publishers in the long term in that situation?
Hopefully publishers will respond in a positive way to these changes, and may come out in the end with a better role than they have at the moment, or could institutional repositories take the place of traditional publishers completely?
DR. ABELSON: I think it was Yogi Berra who said it is really hard to make predictions, especially about the future. But I think the main point is that as you have new players, you have different kinds of roles. I don't see any inherent hostility between institution archives and traditional publishers.
MIT for example has a very respectable journals operation in the MIT press, and we are looking for ways to find joint projects between the press and the DSpace archives. I think what would be very interesting is to see some kind of structural track from pre-prints through certification through authorization process and be able to see some larger piece of that whole process that would be done by a bunch of different institutions coming in collaboratively.
You can imagine just off the top of my head a university holding both the pre-print through the edited version, the journals coming in with some kind of authentication and review cycle. I don't know, I just think there are lots and lots of opportunities. The trick is to free up the system and allow other players in who will provide pieces of that process that the journals for various reasons haven't been.
The danger as I said is, some individual player wants to come in and lock it up in some complete system and says, I own the whole thing. The problem with the World Wide Web has always been that everybody wants to be the spider at the middle of it, and that is the thing that we have to resist.
DR. ATKINS: I think there is a suggestion of an answer to your question in one of Rick Luce's slides, where he was showing that the pre-print servers or the E-print servers are these repositories at the lower layer were providing a platform or an infrastructure in which a whole host of yet to be fully imagined value added entities could be built on top of that, some of them for-profit entities.
So the idea is to create a more open environment for the kind of primary or upstream parts of the value chain, and then to encourage kind of an economy of activity on top of that.
DR. FRIEND: It is just that traditionally, we have looked on publication as being the record. Yet we seem to be saying that long term archiving is not for publishers. So that perhaps rules the record function out for traditional publishers. So what are they left with?
DR. ATKINS: I won't answer that question, but I will just comment that one of the themes that came out loud and clear in the cyberinfrastructure study was the huge latent demand for serious peroration of scientific data and mechanisms for the credentialing and validating of scientific data and for mechanisms for encouraging interoperability between data in different fields as people create more comprehensive computational models of ecologies and environments and so forth.
So one of the recommendations in this report is for NSF to step up to the provisioning of leadership in that arena, and of course, at some level there is a lot of synergy between that and some of the long term institutions. In fact, the volume of bits that for example just the high-energy physics community generates probably exceeds even in a year most of the scientific literature worth keeping. So there is a huge scale there.
DR. LUCE: Just to make a quick follow-on point, certainly representing the governmental sector, institutional repositories, is also an issue of deep concern, in terms of being able to have access to things created with public monies, publicly available.
One can easily imagine a system again that takes into account, if that is at the bottom level, things like usage activity, annotation systems, where there are lots of annotations going on. That would suggest that here are some nuggets that perhaps publishers in a different sense at a different strata might look at, and start to mine that for opportunities to more formalize or put a different value-added spin on that kind of information.
So I don't think these things have to be competitive. In fact, they can exist in a way that is very, very complementary.
DR. ATKINS: Thank you. That was Rick Luce. We'll go to the back here.
DR. KRELLENSTEIN: I'm Mark Krellenstein from Elsevier. I thought people might want a live target up here that they could talk to, so I am offering myself.
First a correction. The profit numbers, revenue numbers, that you quoted weren't correct. Elsevier certainly is a large successful company that is very profitable, but the actual revenue number is a little bit less than the number you talked about. Elsevier's part of that is less than a quarter.
Secondly, you talked about Derek Hanck's comment about producing a universal search engine for license and proprietary content. That really came in response to our users and to the libraries who have told us what their users want. What we hear is that people want the answer to Google for licensed content. They really want as close to a universal search engine for proprietary materials as they have with Google for the non-proprietary.
The idea of multiple players in the value chain, which I think appeals to expert researchers such as me or to you, is less appealing often to undergraduates in particular, who really want to the extent that it is possible a single solution for all their search needs. We don't expect it to be really a single solution, but as much licensed content, ours and other publishers' also, that it is really an open platform that we are selling, not a licensed set of data.
As much as we can put together, we are trying to do that. No differently than Google. We are charging for it because of the model we have for charging for content. Google supports itself via advertising. That could be an option perhaps for a company like Elsevier, but probably not in the scientific space to make the kind of revenue that Google makes, or to make the kind of revenue that is necessary, given other models discussed here. That is a separate discussion, about how we pay for that.
The other point I wanted to make is that we are open also to other players doing the same kind of thing. There is a metasearch initiative going on right now in NISO, which again is trying to respond to libraries' request to have a small number of services for proprietary content, rather than that long list of 60 providers. You go to the library and you see all the electronic services that are available.
So we are working there together with NISO and other large publishers who are here to develop open standards, so that any metasearch company could come in and metasearch these proprietary services.
Finally, we offer a service called Scirus right now, which does provide access to some of that hidden content that Google does not provide access to. It is a free service, www.scirus.com. It has scientific papers from the Web. We only cull the scientific part of the Web. It has got 150 million Web documents. It has all the Elsevier proprietary content. It has other publisher proprietary content, whatever we can license from other publishers. That content is available for a fee. The abstracts are free, but if you click through to that content, if your site is licensed, and you go right through to the full text, and if not, there is a pay per view model, at least in the case of Elsevier.
I guess my question for Professor Abelson is whether you don't see that need also for more consolidation, at least for undergraduates in certain kinds of research at the same time that you want this open federated capability and the ability for many, many players to come in.
There is a simplification. What Google has in fact done is, it has created success, where something equivalent is not desirable to some extent, not completely, but to some extent on the licensed content side.
DR. ATKINS: Any followup? We have ten more minutes left in this session. If there are any of you listening through Web cast that would like to send in questions, please do so quickly. Don King?
DR. KING: Thank you, Dan. Donald King again from University of Pittsburgh. I really am an introvert.
There is another dimension that I think you need to consider when you are assessing the materials that are published. That is the dimension of time following the publication of the articles, or the availability of them in the pre-print archives.
The median age of a citation is something like six to eight years, depending on the field of science. Most of the reading that takes place, about 60 percent of it, takes place within the first year of publication, but almost all of that is for the purpose of keeping up with the literature, knowing what your peers are doing, and things of this kind. Something like 40 percent of that information is already known by the reader.
As the age of the material gets older, the usefulness and the value of that material gets older. But about ten percent of the articles that are read are over 15 years old, and for the most part, particularly in industry, that is particularly useful information.
One of the reasons for that is that science and industry oftentimes reflect the needs of the organization being served. In the academy, you tend to follow a line of research over a long period of time, but in industry you are assigned a new area that has to be followed. Oftentimes that requires a need to go back into the literature much more thoroughly than you have.
All I'm saying is that there are some things that you can begin to get feedback in that will help make that better than it has been in the past.
DR. ATKINS: Thank you. This is the last call for comments or questions.
DR. SMOLENS: Michael Smolens, a company called 3 Billion Books. Just to comment, I haven't heard a lot over the last day and a half about a term that I will call cultural diversity. There are a lot of different cultures and language groups in the world that have a lot to say about a lot of the issues that are being discussed here.
I just want to make people aware of an organization that started in 1998 called the INCD. It is the International Network of Cultural Diversity. It was started by someone in Sweden who got 30,000 artists, writers, musicians together because they couldn't deal with the European Union in their own language of Swedish; they had to deal with it for patent and litigation and other issues.
Their goal is to try to have the issue of very small cultures and language groups be heard at international meetings and consortia, so that when the World Trade Organization is dealing with trade issues, someone there is at least thinking about the fact that language groupings are disappearing very rapidly and culture should be maintained.
So I just would like to point out that the cultural diversity issue around the world is a very, very sensitive issue that I think everyone needs to keep on the top of their minds when they are discussing a lot of these issues.
DR. ATKINS: Thank you for that comment. Anything else? I declare this session adjourned. Thank you very much.
(Brief recess.)
(ADMINISTRATIVE REMARKS)
Agenda Item: Panel 5: What Constitutes a Publication in the Digital Environment?
DR. LYNCH: Good morning. I am Cliff Lynch from the Coalition for Networked Information. I will be moderating this session. For those of you joining us through the Web, welcome back, whatever time zone you may be in. Let me remind those participating through the Web cast that they can submit questions which we will take later through the National Academies Web page.
This is the fifth and final topical session before our wrap-up session this afternoon. This session was designed as part of a pair with the earlier session that Dan Atkins moderated this morning, which I think almost all of you were at.
In this session, we want to start taking a look at the issue of what constitutes a publication in this digital world that is evolving, with specific attention of course to the frameworks of science and of journals.
I do want to say parenthetically that while we are going to focus on science here, there is lots of action in scholarship broadly beyond science, engineering, medicine and technology. For example, the humanities have been very active and very creative in exploring the use of the new media. I refer you among other sources to a colloquium that was jointly sponsored by the National Research Council, the Coalition for Network Cultural Heritage, my organization, and others in January of this year, to take a look at some of what the humanities are doing.
As we look at this question of what constitutes a publication, how the character of publications change, I am hoping we can switch our focus from the environmental questions that Dan's panel was talking about, which dealt with publication as process, with the flow, with how we select, how we disseminate, to hear questions about how we author and how we bind together pieces of authorship into structures like journals.
If you think about it, I believe we can approach this from two kinds of perspectives. One is to recognize that -- is to take it from the individual author's point of view, to recognize that as has been well documented for example in the cyberinfrastructure report you heard about earlier, the practice of science is changing. It is becoming much more data intensive. Simulations are becoming a more important part of some scientific practices. We are seeing the development of community databases that structure and disseminate knowledge alongside of traditional publishing.
In that kind of a world, you can ask questions about how do people author articles, how should they author articles. It is clear that articles are or can be in the digital environment more than just paper by digital means. It is a sad fact, if you look at most of the journals on the Net today, they really are paper by digital means. In fact, they are often printed for serious reading and engagement. We still are using all of this technology around an authorship model that is strongly rooted in paper.
We can add multi-media. There are trivial extensions, but there are less trivial extensions, too. How we structure our arguments, recognizing that as Hal Abelson hinted earlier, not all of our readers are going to be human. Programs are going to read the things we write, and programs aren't very bright sometimes, and you have to write differently for them. So that is one perspective we can look at.
The other is the perspective of the journal, of the aggregation of these articles, about recognizing that the ecology in which journals exist has changed radically. It used to be that the other things in that ecology were other print journals and secondary abstract and indexing sorts of databases. Now it has become very complicated. There are all kinds of data repositories. There are live linkages among journals. There is an interweaving of data and authored argument which is becoming very complex.
So these are the kinds of questions that I hope we will have an opportunity to engage in our session this morning.
We have three speakers on our panel. Each will speak for about 15 minutes or so as we have done in the other panels. After all three of our speakers have been heard from, we will open it up to questions, and I will moderate.
Let me very briefly introduce our speakers in the order that they will be presenting. You can find longer biographies in the packet that you've got, so I will be very brief here.
Our first speaker is Monica Bradford. She is the Executive Editor of Science. Our second speaker is Alex Szalay. He is the Alumni Centennial Professor at the Department of Physics and Astronomy at Johns Hopkins. Our third and final speaker will be David Lipman. He is the Director of the National Center for Biotechnology Information at the National Institutes of Health.
I will invite Monica to come and give us the first talk.
MS. BRADFORD: Thank you, Cliff. It is a pleasure to be here this morning with you all and to tell you a little bit about Science's STKE, which stands for Signal Transduction Knowledge Environment. I am going to tell you a little bit about the history of the publication, a little bit about the current status, talk about some of the specific issues related to defining what is a publication in the digital environment, and then if time permits, talk a little bit about how we have tried to use the power of a more traditional publication that you may be aware of known as Science to help move forward this slightly less traditional project.
This was started in 1997. At the time we thought it was a bold experiment, even though we were a bunch of old players, I just learned, but we were feeling bold at the time.
It is a joint project between AAAS and Stanford University Library and at the time, also Island Press. The reason the three groups came together is, Stanford University Library was very interested in making sure that the nonprofit publishers and the smaller publishers were able to be players online as we moved into the digital environment. They also had started up HighWire Press. We had at that time put Science online, and we were excited about the possibility of working with them on new technology ideas.
Island Press is a small environmental publisher, primarily books, but they had ties to the Pew Charitable Trust, who was interested in funding some kind of experiment online. Island Press was helping to determine what the right area might be for that experiment. They were particularly interested in the intersection of science and policy.
Then of course, AAAS -- we had been so excited about putting Science online and all the potential that we felt the online environment offered to us, that we were eager to try something new. Floyd Bloom had helped us get up Science, and it seemed like such an easy thing to do. So what was next? We could do everything, we thought.
Specifically the goals of the knowledge environment were to move beyond the full text journal online and to provide researchers with an online environment that linked all the different kinds of information they use, not just their journals, but link it together so that they could move more easily and decrease the time that was required for gathering information, giving them much more time for valuable research and increasing therefore their productivity.
Why signal transduction? That was the first area we picked. The reason was primarily that our funders wanted us to try to find an area that hopefully at some point could become self sustaining. So therefore, science at the intersection with policy was quickly eliminated, particularly because so much of the literature is actually what we refer to as gray literature. It was not digitized, it wasn't clear how you would get there, so we moved to an area where we at AAAS and Science in particular were very comfortable, and that was signal transduction.
As you can see, it is very interdisciplinary within the life sciences, biological sciences. You have cell biologists, molecular biologists, developmental biologists, neuroscientists, structural biologists, immunologists, microbiologists, all of them at some point in time need to know something. They come to a point in their research where they need to know something about signal transduction.
Also, there were some business reasons, though I am not talking about cost or revenue, the kind of things the publisher typically looks for. We thought there was a broad potential user base. Both industry and academia was very interested in this topic. It didn't have one primary journal at the time; the information was spread across a lot of journals. There was no major society. But more important, there were some things about the study and the research and the kind of information that were most important to this, and for our reasons of wanting to pursue it.
The area of signal transduction is very complex and the information is widely distributed. We felt it was important to be able to create links between these discreet pie |