An efficient journal

06 Mar 2012

...time to switch... — “You seem to believe in fairies.” Photo of the Cottingley Fairies, 1917, by Elsie Wright via Wikipedia.

Aficionados of open access should know about the Journal of Machine Learning Research (JMLR), an open-access journal in my own research field of artificial intelligence, a subfield of computer science concerned with the computational implementation and understanding of behaviors that in humans are considered intelligent. The journal became the topic of some dispute in a conversation that took place a few months ago in the comment stream of the Scholarly Kitchen blog between computer science professor Yann LeCun and scholarly journal publisher Kent Anderson, with LeCun stating that "The best publications in my field are not only open access, but completely free to the readers and to the authors." He used JMLR as the exemplar. Anderson expressed incredulity:

I’m not entirely clear how JMLR is supported, but there is financial and infrastructure support going on, most likely from MIT. The servers are not "marginal cost = 0" — as a computer scientist, you surely understand the 20-25% annual maintenance costs for computer systems (upgrades, repairs, expansion, updates). MIT is probably footing the bill for this. The journal has a 27% acceptance rate, so there is definitely a selection process going on. There is an EIC, a managing editor, and a production editor, all likely paid positions. There is a Webmaster. I think your understanding of JMLR’s financing is only slightly worse than mine — I don’t understand how it’s financed, but I know it’s financed somehow. You seem to believe in fairies.

Since I have some pretty substantial knowledge of JMLR and how it works, I thought I'd comment on the facts of the matter.

Is the pot calling the kettle black?

26 Feb 2012

Is the pot calling the kettle black? Oh sure, journal prices are going up, but so is tuition. How can universities complain about journal price hyperinflation if tuition is hyperinflating too? Why can't universities use that income stream to pay for the rising journal costs?

There are several problems with this argument, above and beyond the obvious one that two wrongs don't make a right.

First, tuition fees aren't the bulk of a university's revenue stream. So even if it were true that tuition is hyperinflating at the pace of journal prices, that wouldn't mean that university revenues were keeping pace with journal prices.

Second, a journal is a monopolistic good. If its price hyperinflates, buyers can't go elsewhere for a substitute; it's pay or do without. But a college education can be arranged for at thousands of institutions. Students and their families can and do shop around for the best bang for the buck. (Just do a search for "best college values" for the evidence.) In economists' parlance, colleges are economic substitutes. So even if it were true that tuition at a given college is hyperinflating at the pace of journal prices, individual students can adjust accordingly. As the College Board says in their report on “Trends in College Pricing 2011”:

Neither changes in average published prices nor changes in average net prices necessarily describe the circumstances facing individual students. There is considerable variation in prices across sectors and across states and regions as well as among institutions within these categories. College students in the United States have a wide variety of educational institutions from which to choose, and these come with many different price tags.

Third, a journal article is a pure information good. What you buy is the content. Pure information goods include things like novels and music CDs. They tend to have high fixed costs and low marginal costs, leading to large economies of scale. But a college education is not a pure information good. Sure, you are paying in part to acquire some particular knowledge, say, by listening to a lecture. But far more important are the interpersonal processes that a student participates in: interacting with faculty, other instructional staff, librarians, other students, in their dormitories, labs, libraries, and classrooms, and so forth. It is through the person-to-person hands-on interactions that a college education develops knowledge, skills, and character.

This aspect of college education has high marginal costs. One would not expect it to exhibit the economies of scale of a pure information good. So even if it were true that tuition is hyperinflating at the pace of journal prices, that would not take the journals off the hook; they should be able to operate with much higher economies of scale than a college by virtue of the type of good they are.^[1]

Which makes it all the more surprising that the claims about college tuition hyperinflating at the rate of journals are, as it turns out, just plain false.

Let's look at what the average Harvard College student pays for his or her education.

Switching to open access for the new year

05 Jan 2012

The journal Research in Learning Technology has switched its approach from closed to open access as of New Year's 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.

This isn't the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.

The ALT has done lots of things right in this change. They've chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They've jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They've opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher's web site.

Here's hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It's time to switch.

Clarifying the Harvard policies: a response

02 Dec 2011

My friend and ex-colleague Matt Welsh has an interesting post supporting the Research Without Walls pledge, in which he talks about the Harvard open-access policies. He says:

Another way to fight back is for your home institution to require all of your work be made open. Harvard was one of the first major universities to do this. This ambitious effort, spearheaded by my colleague Stuart Shieber, required all Harvard affiliates to submit copies of their published work to the open-access Harvard DASH archive. While in theory this sounds great, there are several problems with this in practice. First, it requires individual scientists to do the legwork of securing the rights and submitting the work to the archive. This is a huge pain and most folks don't bother. Second, it requires that scientists attach a Harvard-supplied "rider" to the copyright license (e.g., from the ACM or IEEE) allowing Harvard to maintain an open-access copy in the DASH repository. Many, many publishers have pushed back on this. Harvard's response was to allow its affiliates to get an (automatic) waiver of the open-access requirement. Well, as soon as word got out that Harvard was granting these waivers, the publishers started refusing to accept the riders wholesale, claiming that the scientist could just request a waiver. So the publishers tend to win.

I wrote a response to his post, clarifying some apparent misconceptions about the policy, but it was too long for his blogging platform's comment system, so I decided to post it here in its entirety. Here it is:

There's a lot to like about your post, and I agree with much of what you say. But I'd like to clarify some specific issues about the Harvard open-access policies, which are in place at seven of the Harvard schools as well as MIT, Duke, Stanford, and elsewhere.

The policy has two aspects. First, the policy commits faculty to (as you say) "submitting the work to the archive", that is, providing a copy of the final manuscript of each article, to be deposited into Harvard's DASH open-access repository. Doing so involves filling out a web form with metadata about the article and uploading a file. But if that is too much trouble, we provide a simpler web form that is tantamount to just uploading the file. Or you can email the file to the OSC. Or one of our "open-access fellows" can make the deposit on your behalf. We also harvest articles from other repositories such as PubMed Central and arXiv. I can't imagine that providing the articles is "a huge pain".

Second, by virtue of the policy, Harvard faculty grant a nonexclusive transferable license to the university in all our scholarly articles. This license occurs as soon as copyright vests in the article, so it predates and therefore dominates any later transfer of copyright to a publisher. Since the policy license is transferable, the university can and does transfer it back to the author, so the author automatically retains rights in each article, without having to take any further action. Because of this policy, the "legwork of securing the rights" is actually eliminated. By doing nothing at all, the author retains rights in the article.

You mention attaching a rider to publication agreements. Although we provide an addendum generator to generate such riders, and we recommend that authors use them, attaching an addendum is not required to retain rights. The only point of the addendum is to alert the publisher that the author has already given Harvard non-exclusive rights to the article (though publishers undoubtedly are already aware of the fact; the policy and its license have been widely publicized).

Because we want the policy to work in the interest of faculty and guarantee the free choice of faculty as to the disposition of their works, the license is waivable at the sole discretion of the author. Thus, rights retention moves from an opt-in regime without the policy to an opt-out regime with the policy. The waiver aspect of the policy was not a response to publisher pushback, but has in fact been in the policies from the beginning. The waiver was intended to preserve complete freedom of choice for authors in rights retention.

As is found in many areas (organ donation, 401K participation), participation tends to be much higher with opt-out than opt-in systems, and that holds for rights retention as well. We have found that the waiver rate is extraordinarily low, contra your assumption. For FAS, we estimate it at perhaps 5% of articles. In total, the number of waivers we have issued is in the very low hundreds, out of the many thousands of articles that have been published by Harvard faculty since the policy was in force. MIT has tracked the waiver rate more accurately, and has reported a 1.5% waiver rate. So for well over 90% of articles, authors are retaining broad rights to use their articles.

The statement that "Many, many publishers have pushed back on this" is false. Less than a handful of publishers have established systematic policies to require waivers of the license, which accounts for the exceptionally low waiver rate. Indeed, over a third of all waivers are attributable to a single journal.

The Harvard approach to rights retention and open-access provision for articles is not a silver bullet to solve all problems in scholarly publishing. It has a limited goal: to provide an alternate venue for openly disseminating our articles and to retain the rights to do so. It is extremely successful at that goal. Many thousands of articles have been deposited in DASH, accounting for over half a million downloads. Nonetheless, other efforts need to be made to address the underlying market dysfunction in scholarly publishing, and we are actively engaged there too. For those interested in what we're up to along those lines, I recommend taking a look at the various posts at my blog, The Occasional Pamphlet, which discusses issues of open access and scholarly communication more generally.

Conan Doyle on the prevention of cruelty to books

01 Dec 2011

I've been reading Arthur Conan Doyle's first novel, The Narrative of John Smith, just published for the first time by the British Library. It's no The Adventures of Sherlock Holmes, that's for sure. For one thing, he seems to have left out any semblance of plot. But it does incorporate some entertaining pronouncements. Here's one I identify with highly:

There should be a Society for the Prevention of Cruelty to Books. I hate to see the poor patient things knocked about and disfigured. A book is a mummified soul embalmed in morocco leather and printer's ink instead of cerecloths and unguents. It is the concentrated essence of a man. Poor Horatius Flaccus has turned to an impalpable powder by this time, but there is his very spirit stuck like a fly in amber, in that brown-backed volume in the corner. A line of books should make a man subdued and reverent. If he cannot learn to treat them with becoming decency he should be forced.

If a bibliophile House of Commons were to pass a 'Bill for the better preservation of books' we should have paragraphs of this sort under the headings of 'Police Intelligence' in the newspapers of the year 2000: 'Marylebone Police Court. Brutal outrage upon an Elzevir Virgil. James Brown, a savage-looking elderly man, was charged with a cowardly attack upon a copy of Virgil's poems issued by the Elzevir press. Police Constable Jones deposed that on Tuesday evening about seven o'clock some of the neighbours complained to him of the prisoner's conduct. He saw him sitting at an open window with the book in front of him which he was dog-earing, thumb-marking and otherwise ill using. Prisoner expressed the greatest surprise upon being arrested. John Robinson, librarian of the casualty section of the British Museum, deposed to the book, having been brought in in a condition which could only have arisen from extreme violence. It was dog-eared in thirty-one places, page forty-six was suffering from a clean cut four inches long, and the whole volume was a mass of pencil — and finger — marks. Prisoner, on being asked for his defence, remarked that the book was his own and that he might do what he liked with it. Magistrate: "Nothing of the kind, sir! Your wife and children are your own but the law does not allow you to ill treat them! I shall decree a judicial separation between the Virgil and yourself: and condemn you to a week's hard labour." Prisoner was removed, protesting. The book is doing well and will soon be able to quit the museum.'

Portrait of Arthur Conan Doyle by Sidney Paget, c. 1890

What a wonderful, wonderful thing it is, though use has dulled our admiration of it! Here are all these dead men lurking inside my oaken case, ready to come out and talk to me whenever I may desire it. Do I wish philosophy? Here are Aristotle, Plato, Bacon, Kant and Descartes, all ready to confide to one their very inmost thoughts upon a subject which they have made their own. Am I dreamy and poetical? Out come Heine and Shelley and Goethe and Keats with all their wealth of harmony and imagination. Or am I in need of amusement on the long winter evenings? You have but to light your reading lamp and beckon to any one of the world's great storytellers, and the dead man will come forth and prattle to you by the hour. That reading-lamp is the real Aladdin's wonder for summoning the genii with. Indeed, the dead are such good company that one is apt to think too little of the living.

I know that there are those who think it is a sign of appreciation to write in, dog-ear, underline, highlight, and otherwise modify books — Anne Fadiman lauds such things as carnal acts — but I can't bring myself to do so. I just can't.

How should funding agencies pay open-access fees?

16 Nov 2011

...a drop in the bucket. Drop I (2007) by Delox - Martin Deák via flickr. Used by permission (CC by-nc-nd) — “...a drop in the bucket.”
*Drop I* (2007) by Delox - Martin Deák via flickr. Used by permission (CC by-nc-nd)

At the recent Berlin 9 conference, there was much talk about the role of funding agencies in open-access publication, both through funding-agency-operated journals like the new eLife journal and through direct reimbursement of publication fees. I've written in the past about the importance of universities underwriting open-access publication fees, but only tangentially about the role of funding agencies. To correct that oversight, I provide in this post my thoughts on how best to organize a funding agency's open-access underwriting system.

The motivation for underwriting publication fees is simple: Publishers provide valuable services to authors: management of peer review; production (copy-editing and typesetting); filtering, branding, and imprimatur. Although access to scholarly articles can now be provided at essentially zero marginal cost through digital networks, some means for paying for these so-called first-copy costs needs to be found in order to preserve these services. The natural business model is the open-access journal funded by article processing fees. (Although most current open-access journals charge no article processing fees, I will abuse the term "open-access journal" for this model.) Open-access (OA) journals are no longer an oddity, a fringe phenomenon. The largest scholarly journal on earth, PLoS ONE, is an OA journal. Major publishers — Springer, Elsevier, SAGE, Nature Publishing Group — are now publishing OA journals.

However, OA journals are currently at a significant disadvantage with respect to subscription journals, because universities and funding agencies subsidize the costs of subscription journals in such a way that authors do not need to trade off money used for the subsidy against money used for other purchases. In particular, subscription fees are paid by universities through their library budgets and by funding agencies through their overhead payments that fund those libraries. Authors do not see, let alone attend to, these costs. In such a situation, an author is inclined to publish in a subscription journal, where they do not need to use any moneys that could otherwise be applied to other uses, rather than an OA journal that requires payment of a publication fee. And if authors are unwilling to publish in open-access journals because of the fees, publishers — even those interested and motivated to switch to an OA revenue model — are unable to do so.

The solution is clear: universities and funding agencies should underwrite reasonable OA publication fees just as they do subscription fees. But how should this be done? Each kind of institution needs to provide its fair share of support.

As I've written about before, universities can underwrite processing fees on behalf of their faculty, and do so in a way that does not reintroduce a moral hazard, by reimbursing faculty for OA publication fees up to a fixed cap per year. Since these funds can only be used for open access fees, they can't be traded off against other purchases, so they don't provide a disincentive against open access journals. On the other hand, since these funds are limited (capped), they provide a market signal to motivate choosing among open access journals so that the economic incentives will militate toward low-cost high-service open access journals.

This is the argument for the Compact for Open-Access Publishing Equity (COPE), a commitment by universities to establish mechanisms for underwriting OA publication fees. COPE has grown well beyond its initial five signatories and is supported by a wide range of institutions and people. Harvard and other COPE signatories have already set up such OA funds, which work in just this way.

Many COPE-compliant OA funds don't underwrite articles that were developed under research grants, under the view that such funding is the responsibility of the granting institutions. COPE calls for universities to do their fair share of paying OA fees, no less, but no more. Funding agencies need to underwrite their share of OA fees as well, and crucially should do so in a way that respects several important criteria:

They level the playing field completely, at least for cost-efficient OA journals.
They recognize that publication of research results often occurs after grants have ended.
They provide incentive for publishers to switch revenue model to the OA publication fee model, or at least provide no disincentive.
They avoid the moral hazard of insulating authors from the costs of their publishing.
They don't place an undue burden on funders that would require reducing the impact of research they fund.

Of course, many funders already allow grantees to pay for OA publication fees from their grants. But this method falls afoul of some of these criteria. With respect to criterion (1), grantees are forced to trade off uses of grant moneys to pay OA fees against uses to pay for other research expenses, providing incentive to publish in subscription-fee journals where these costs are hidden. This approach maintains the tilted playing field against OA journals. With respect to criterion (2), because the funds must be expended during the granting period, grantees must predict ahead of time how many articles they will be publishing in OA journals, where they will be publishing them, and those articles must be completed and accepted for publication by the end of the granting period.

The mechanism that satisfies these criteria is for funding agencies to provide non-fungible funds specifically for OA publication fees, funds that are not usable for purchasing other grant-related materials. Funders would establish a policy that grantees could be reimbursed for OA publication fees for articles based on grant-funded research at any time during or after the period of the grant. This satisfies criterion (1) because grantees would no longer have to pay publication fees out of pocket or from grant funds that could be used otherwise. It satisfies criterion (2) because payments can be provided after the end of the grant. (If desired, the delay after the grant ends can be limited to, say, a year or two.) A reasonable requirement for reimbursement of publication fees would be that the article explicitly acknowledge the grant as a source of research funding.

Wellcome Trust already uses a similar incremental funding system. However, they (inadvisably in my mind) allow the funds to apply to so-called hybrid publication fees, where an additional fee can be paid to make a single article available open access. These reimbursements should be limited to publication fees for true OA journals, not hybrid fees for subscription journals. Willingness to pay hybrid fees provides an incentive for a publisher to maintain the subscription revenue model for a journal, because the publisher can acquire these funds without converting the journal as a whole to open access. Eschewing hybrid fees is necessary to satisfy criterion (3).

If funders were willing to pay arbitrary amounts for publication fees without limit, a new moral hazard would be introduced into the publishing market. Authors would become price-insensitive and hyperinflation of publication fees would be possible. To retain a functioning market in publication fees, we must be careful in designing the reimbursement scheme for OA journals; we need to make sure that there is still some scarce resource that authors must manage. This can be achieved in a couple of ways, by capping reimbursements or by copayments. First, reimbursement of OA publication fees can be offered only up to a fixed percentage of the grant amount. By way of example, if an average NIH grant is $300,000 (excluding overhead^[1]), a cap of, say, 2% would provide up to $6,000 available for OA fees. (Robert Kiley, Head of Digital Services at the Wellcome Trust, estimates that at present rates all funded papers of the Wellcome Trust could be underwritten for about 1.25% of their total granted funds. In the short run, nowhere near that level of underwriting is necessary, since the number of publication-fee-charging OA journals is so small. In the long run, as competition in the publication fee market increases, this number may well go down.) That would cover two PLoS Biology papers, three BMC papers, four or five PLoS ONE papers, eight or so Hindawi papers. A grantee would apply separately for these funds to reimburse reasonable OA fees. Some grantees might use all of these funds, some none, with most falling in the middle (and currently at the low end); but in any case they would not be usable for other purposes. Since these funds can only be used for OA publication fees, they can't be traded off against other purchases, so there is no disincentive against selecting OA journals. On the other hand, since these funds are limited (capped), they provide a market signal to motivate choosing among open access journals so that the economic incentives will militate toward low-cost high-service OA journals. (This can't be repeated often enough.)

Alternatively, a copayment approach can be used to provide economic pressure to keep publication fees down. Reimbursement would cover only part of the fee, at least at the expensive end of such fees. It is important (criterion 1) that for cost-efficient OA journals, authors should not be out of pocket for any fees. Thus, reimbursement should be at 100% for journals charging less than some threshold amount, say, $1,500. (As publishers become more efficient, this threshold can and should be reduced over time.) Above that level, the funder might pay only a proportion of the fee, say, 50%, so that grantees have some "skin in the game" and are motivated to trade off publication fees against quality of publisher services. With these parameters, the payment schedule would provide for the following kinds of payments:

Publication fee	Funder pays	Author copays	Examples
$700	$700	$0	typical Hindawi journal, SAGE Open
$1350	$1350	$0	PLoS ONE, Scientific Reports
$2000	$1750	$250	typical BMC journal
$2900	$2200	$700	PLoS Biology

(What the right parameters of such an approach are may depend on field and may change over time. I don't propose these as the correct values, but merely provide an example of the workings of such a system.)

These two approaches are complementary. A policy could involve both a per-article copayment and a maximum per-grant outlay.

Finally, criterion (5) calls for implementing such an underwriting scheme as cost-effectively as possible, so that a funder's research impact is not lessened by paying for publication fees. Indeed, one might expect that impact would be increased by such a move, given that the tiny percentage of funds going to OA fees would mean that those research results were freely and openly available to readers and to machine analysis throughout the world. I would think (and I recall a claim to this effect at Berlin 9) that the impact benefit of providing open access to a funder's research results is greater than the impact of the marginal funded research grant. To the extent that this is so, it behooves funders to underwrite OA fees even at the expense of funding the incremental research. Nonetheless, there may be no need to forego funding research just to pay OA fees. Suppose that on the average grant incremental funds of $200 are used to pay OA publication fees. (With current availability and usage of OA journals, this is likely an overestimate of current demand for OA fees.) Where would this money come from? To the extent that faculty are publishing in OA journals, funders should not need to underwrite subscription journals, so that their overhead rates can be reduced accordingly. An overhead rate of 67% (Harvard's current rate) would need to be reduced by a minuscule 0.067% to compensate. (This is not a typo. The number really is 0.067%, not 6.7%.) This constitutes a percentage reduction in overhead of one part in a thousand, a drop in the bucket. In the longer term over several years if usage of the funds rises to, say, $1000 per grant, the overhead rate would need to be reduced by a still tiny 0.33% for cost neutrality. As more OA journals become available and more funds are used, the overhead rate would be adjusted accordingly. If hypothetically all journals became OA, and all articles incurred these charges, the cost per grant might rise higher to Wellcome Trust's predicted 1.25% (though by this point competition may have substantially reduced the fees), but then, larger reductions in overhead rates would be met by reduced university costs, since libraries would no longer need to pay subscription fees.

One of the nice properties of this approach is that it doesn't require synchronization of the many actors involved. Each funding agency can unilaterally start providing OA fee reimbursement along these lines. Until a critical mass do so, the costs would be minimal. Once a critical mass is obtained, and journals feel confident enough that a sufficient proportion of their author pool will be covered by such a fund to switch to an open-access revenue model, subscription fees to libraries will drop, allowing for overhead rates to be reduced commensurately to cover the increasing underwriting costs. Each actor — author, funder, publisher, university, library — acts independently, with a market mechanism to move all towards a system based on open access.

It is time for funding agencies to take on the responsibility not only to fund research but its optimal distribution. Part of that responsibility is putting in place an economically sustainable system of underwriting open-access publication fees.

^[1]The NIH Data Book reports average grant size for 2010 as around $450,000, which corresponds to something like $270,000 assuming a 67% overhead rate. $300,000 is thus likely on the high side.

The future of the library, expressed in sculpture

13 Oct 2011

Petrus Spronk, “Architectural Fragment”, 1992. Photo © 2005 Robert Laddish, used by permission. — Petrus Spronk, “Architectural Fragment”, 1992. Photo © 2005 Robert Laddish (www.laddish.net), used by permission.

I've just been at the conference in honor of the 30th anniversary of the University of Sao Paulo Integrated Library System (SIBi USP). David Palmer, one of the speakers at the conference, used in his presentation a picture of a wonderful sculpture that I had never seen before, which turned out to be a public art piece at the State Library of Victoria in Melbourne, Australia by Petrus Spronk entitled "Architectural Fragment". I place a couple of pictures of it here in honor of Spronk's 72nd birthday, which happens to be today. You can find more images here.

Petrus Spronk, "Architectural Fragment", 1992. Photo by flickr user madam3181, used by permission (CC by-nc-nd).

Tales of peer review, episode 1: Boyer and Moore's MJRTY algorithm

23 Sep 2011

I'm generally a big fan of peer review. I think it plays an important role in the improvement and "chromatography" of the scholarly literature. But sometimes. Sometimes.

The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority without counting each shape. — The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority *without counting each shape*.

This past week I was reading Robert Boyer and J Strother Moore's paper on computing the majority element of a multiset, which presents a very clever simple algorithm for this fundamental problem and a description of a mechanical proof of its correctness. The authors aptly consider the work a "minor landmark in the development of formal verification and automated reasoning".

Below is the postscript to that paper, in its entirety, which describes the history of the paper including how and why it was "repeatedly rejected for publication". (It was eventually published as a chapter in a 1991 festschrift for Woody Bledsoe, ten years after it was written, and is now also available from Moore's website.)

In this paper we have described a linear time majority vote algorithm and discussed the mechanically checked correctness proof of a Fortran implementation of it. This work has a rather convoluted history which we would here like to clarify.

The algorithm described here was invented in 1980 while we worked at SRI International. A colleague at SRI, working on fault tolerance, was trying to specify some algorithms using the logic supported by "Boyer-Moore Theorem Prover." He asked us for an elegant definition within that logic of the notion of the majority element of a list. Our answer to this challenge was the recursive expression of the algorithm described here.

In late 1980, we wrote a Fortran version of the algorithm and proved it correct mechanically. In February, 1981, we wrote this paper, describing that work. In our minds the paper was noteworthy because it simultaneously announced an interesting new algorithm and offered a mechanically checked correctness proof. We submitted the paper for publication.

In 1981 we moved to the University of Texas. Jay Misra, a colleague at UT, heard our presentation of the algorithm to an NSF site-visit team. According to Misra (private communication, 1990): "I wondered how to generalize [the algorithm] to detect elements that occur more than n/k times, for all k, k ≥ 2. I developed algorithm 2 [given in Section 3 of [9]] which is directly inspired by your algorithm. Also, I showed that this algorithm is optimal [Section 5, op. cit.]. On a visit to Cornell, I showed all this to David Gries; he was inspired enough to contribute algorithm 1 [Section 2, op. cit.]." In 1982, Misra and Gries published their work [9], citing our technical report appropriately as "submitted for publication."

However, our paper was repeatedly rejected for publication, largely because of its emphasis on Fortran and mechanical verification. A rewritten version emphasizing the algorithm itself was rejected on the grounds that the work was superceded by the paper of Misra and Gries!

When we were invited to contribute to the Bledsoe festschrift we decided to use the opportunity to put our original paper into the literature. We still think of this as a minor landmark in the development of formal verification and automated reasoning: here for the first time a new algorithm is presented along with its mechanically checked correctness proof—eleven years after the work.

I have to think the world would have been better off if Boyer and Moore had just posted the paper to the web in 1981 and been done with it. Unfortunately, the web hadn't been developed yet.

Subscription fees as a distribution control mechanism

11 Sep 2011

Stamps to mark "restricted data" (modified from "atomic stamps 1" by flickr user donovanbeeson, used by permission under CC by-nc-sa)

Ten years ago today was the largest terrorist action in United States history, an event that highlighted the importance of intelligence, and its reliance on information classification and control, for the defense of the country. This anniversary precipitated Peter Suber's important message, which starts from the fact that access to knowledge is not always a good. He addresses the question of whether open access to the scholarly literature might make information too freely available to actors who do not have the best interests of the United States (or your country here) at heart. Do we really want everyone on earth to have information about public-key cryptosystems or exothermic chemical reactions? Should our foreign business competitors freely reap the fruits of research that American taxpayers funded? He says,

You might think that no one would seriously argue that using prices to restrict access to knowledge would contribute to a country's national and economic security. But a vice president of the Association of American Publishers made that argument in 2006. He "rejected the idea that the government should mandate that taxpayer financed research should be open to the public, saying he could not see how it was in the national interest. 'Remember -- you're talking about free online access to the world,' he said. 'You are talking about making our competitive research available to foreign governments and corporations.' "

Suber's response is that "If we're willing to restrict knowledge for good people in order to restrict knowledge for bad people, at least when the risks of harm are sufficiently high, then we already have a classification system to do this." (He provides a more detailed response in an earlier newsletter.) He is exactly right. Placing a $30 paywall in front of everyone to read an article in order to keep terrorists from having access to it is both ineffective (relying on al Qaeda's coffers to drop below the $30 point is not a counterterrorism strategy) and overreaching (since a side effect is to disenfranchise the overwhelming majority of human beings who are not enemies of the state). Instead, research that the country deems too dangerous to distribute should be, and is, classified, and therefore kept from both open access and toll access journals.

This argument against open access, that it might inadvertently abet competitors of the state, is an instance of a more general worry about open distribution being too broad. Another instance is the "corporate free-riding" argument. It is argued that moving to an open-access framework for journals would be a windfall to corporations (the canonical example is big pharma) who would no longer have to subscribe to journals to gain the benefit of their knowledge and would thus be free-riding. To which the natural response would be "and what exactly is wrong with that?" Scientists do research to benefit society, and corporate use of the fruits of the research is one of those benefits. Indeed, making research results freely available is a much fairer system, since it allows businesses both large and small to avail themselves of the results. Why should only businesses with deep pockets be able to take advantage of research, much of which is funded by the government.

But shouldn't companies pay their fair share for these results? Who could argue with that? To assume that the subscription fees that companies pay constitute their fair share for research requires several implicit assumptions that bear examination.

Assumption 1: Corporate subscriptions are a nontrivial sum. Do corporate subscriptions constitute a significant fraction of journal revenues? Unfortunately, there are to my knowledge no reliable data on the degree to which corporate subscriptions contribute to revenue. Estimates range from 0% (certainly the case in most fields of research outside the life sciences and technology) to 15-17% to 25% (a figure that has appeared informally and been challenged in favor of a 5-10% figure). (Thanks to Peter Suber for help in finding these references.) None of these estimates were backed up in any way. Without any well-founded figures, it doesn't seem reasonable to be worrying about the issue. The onus is on those proposing corporate free-riding as a major problem to provide some kind of transparently supportable figures.

Assumption 2: Corporations would pay less under open access. The argument assumes that in an open-access world, journal revenues from corporations would drop, because they would save money on subscriptions but would not be supporting publication of articles through publication fees. That is, corporate researchers "read more than they write." Of course, corporate researchers publish in the scholarly literature as well (as I did for the first part of my career when I was a researcher at SRI International), and thus would be contributing to the financial support of the publishing ecology. Here again, I know of no data on the percentage of articles with corporate authors and how that compares to the percentage of revenue from corporate subscriptions.

Assumption 3: Corporations shouldn't be paying less than they now are, perhaps for reasons of justice, or perhaps on the more mercenary basis of financial reality. It is presumed that if corporations are not paying subscription fees (and, again by assumption, publication fees) then academia will have to pick up the slack through commensurately higher publication fees, so the total expenditure by academia will be higher. This is taken to be a bad thing, but the reason for that is not clear. Why is it assumed that the "right" apportionment of fees between academia and business is whatever we happen to have at the moment, resulting as it does from historical happenstance based on differential subscription rates and corporate and university budget decisions? Free riding in the objectionable sense is to get something without paying when one ought to pay. But the latter condition doesn't apply to the open-access scholarly literature any more than it applies to broadcast television.

Assumption 4: Corporations only support research through subscription fees. However, corporations also provide support for funded research through the corporate taxes that they pay to the government, which funds the research. And this mode of payment has the advantage that it covers all parts of the research process, not just the small percentage that constitutes the publishing of the final results. Corporate taxes constitute some 10% of total US tax revenue according to the IRS, so we can impute corporate underwriting of US-government funded research at that same 10% level. (In fact, since many non-corporate taxes, like FICA taxes, are earmarked for particular programs that don't fund research, the imputed percentage should perhaps be even higher.) The subscription fees companies pay is above and beyond that. Is the corporate 10% not already a fair share? Might it even be too much?

If we collectively thought that the amount corporations are paying is insufficient, then the right response would be to increase the corporate taxes accordingly, so that all corporations contribute to the underwriting of scientific research that they all would be benefitting from. Let's take a look at some numbers. The revenue from the 2.5 million US corporations paying corporate tax for 2009 (the last year for which data are available) was about $225 billion. The NSF budget for 2009 was $5.4 billion. So, for instance, a 50% increase in the NSF budget would require increasing corporate tax revenues by a little over 1%, that is, from a 35% corporate tax rate (say) to something like 35.4%. I'm not advocating an increase in corporate taxes for this purpose. First, I'm in no way convinced that corporations aren't already supporting research sufficiently. Second, there are many other effects of corporate taxes that may militate against raising them. Instead, the point is that it is naive to pick out a single revenue source, subscription fees, as the sum total of corporate support of research.

Assumption 5: Subscription fees actually pay for research, or some pertinent aspect of research. But those fees do not devolve to the researchers or cover any aspect of the research process except for the publication aspect, and publishing constitutes only a small part of the costs of doing research. To avoid disingenuousness, shouldn't anyone worrying about whether corporations are doing their fair share in underwriting that aspect be worrying about whether they are doing their fair share in underwriting the other aspects as well? Of course, corporations arguably are underwriting other aspects — through internal research groups, grants to universities and research labs, and their corporate taxes (the 10% discussed above). And in an open-access world, they would be covering the publication aspect as well, namely publication fees, through those same streams.

In summary, maintaining the subscription revenue model for reasons of distribution control — whether for purposes of state defense or corporate free-riding — is a misconstruction.

JSTOR opens access to out-of-copyright articles

08 Sep 2011

Cover of the first issue of the Philosophical Transactions of the Royal Society, dated March 6, 1665. Available from JSTOR's Early Journal Content collection.

JSTOR, the non-profit online journal distributor, announced yesterday that they would be making pre-1923 US articles and pre-1870 non-US articles available for free in a program they call "Early Journal Content". The chosen dates are not random of course; they guarantee that the articles have fallen out of copyright, so such distribution does not run into rights issues. Nonetheless, that doesn't mean that JSTOR could take this action unilaterally. JSTOR is further bound by agreements with the publishers who provided the journals for scanning, which may have precluded them contractually from distributing even public domain materials that were derived from the provided originals. Thus such a program presumably requires cooperation of the journal publishers. In addition, JSTOR requires goodwill from publishers for all of its activities, so unilateral action could have been problematic for its long-run viability. (Such considerations may even in part underly JSTOR's not including all public domain material in the opened collection.)

Arranging for the necessary permissions — whether legal or pro forma — takes time, and JSTOR claims that work towards the opening of these materials started "about a year ago", that is, prior to the recent notorious illicit download program that I have posted about previously. Predictably, the Twittersphere is full of speculation about whether the actions by Aaron Swartz affected the Early Journal Content program:

@grimmelm: JSTOR makes pre-1923 journals freely available http://about.jstor.org/participate-jstor/individuals/early-journal-content Would this have happened earlier or later w/o @aaronsw?

@mecredis: JSTOR makes all their public domain content available for free: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content I think this means @aaronsw wins.

@maxkaiser: Breaking: @JSTOR to provide free #openaccess to pre-1923 content in US & pre-1870 elsewhere - @aaronsw case had impact: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content

@JoshRosenau: JSTOR "working on releasing pre-1923 content before [@aaronsw released lotsa their PDFs], inaccurate to say these events had no impact."

@mariabustillos: Stuff that in yr. pipe and smoke it, JSTOR haters!! http://bit.ly/qtrxdV Also: how now, @aaronsw?

So, did Aaron Swartz's efforts affect the existence of JSTOR's new program or its timing? As to the former, it seems clear that with or without his actions, JSTOR was already on track to provide open access to out-of-copyright materials. As to the latter, JSTOR says that

[I]t would be inaccurate to say that these events have had no impact on our planning. We considered whether to delay or accelerate this action, largely out of concern that people might draw incorrect conclusions about our motivations. In the end, we decided to press ahead with our plans to make the Early Journal Content available, which we believe is in the best interest of our library and publisher partners, and students, scholars, and researchers everywhere.

On its face, the statement implies that JSTOR acted essentially without change, but we'll never know if Swartz's efforts sped up or slowed down the release.

What the Early Journal Content program does show is JSTOR's interest in providing broader access to the scholarly literature, a goal they share with open-access advocates, and even with Aaron Swartz. I hope and expect that JSTOR will continue to push, and even more aggressively, towards broader access to its collection. The scholarly community will be watching.

On guerrilla open access

28 Jul 2011

[Update January 13, 2013: See my post following Aaron Swartz's tragic suicide.]

Aaron Swartz has been indicted for wire fraud, computer fraud, unlawfully obtaining information from a protected computer, and recklessly damaging a protected computer. The alleged activities that led to this indictment were his downloading massive numbers of articles from JSTOR by circumventing IP and MAC address limitations and breaking and entering into restricted areas of the MIT campus to obtain direct access to the network, for the presumed intended purpose of distributing the articles through open file-sharing networks. The allegation is in keeping with his previous calls for achieving open access to the scholarly literature by what he called "guerrilla open access" in a 2008 manifesto: "We need to download scientific journals and upload them to file sharing networks." Because many theorize that Swartz was intending to further the goals of open access by these activities, some people have asked my opinion of his alleged activities.

Before going further, I must present the necessary disclaimers: I am not a lawyer. He is presumed innocent until proven guilty. We don't know if the allegations in the indictment are true, though I haven't seen much in the way of denials (as opposed to apologetics). We don't know what his intentions were or what he planned to do with the millions of articles he downloaded, though none of the potential explanations I've heard make much sense even in their own terms other than the guerrilla OA theory implicit in the indictment. So there is a lot we don't know, which is typical for a pretrial case. But for the purpose of discussion, let's assume that the allegations in the indictment are true and his intention was to provide guerrilla OA to the articles. (Of course, if the allegations are false, as some seem to believe, then my claims below are vacuous. If the claims in the indictment turn out to be false, or colored by other mitigating facts, I for one would be pleased. But I can only go by what I have read in the papers and the indictment.)

There's a lot of silliness that has been expressed on both sides of this case. The pro-Swartz faction is quoted as saying "Aaron's prosecution undermines academic inquiry and democratic principles." Hunh? Or this one: "It's incredible that the government would try to lock someone up for allegedly looking up articles at a library." Swartz could, of course, have looked up any JSTOR article he wanted using his Harvard library privileges, and could even have text-mined the entire collection through JSTOR's Data for Research program, but that's not what he did. Or this howler: "It's like trying to put someone in jail for allegedly checking too many books out of the library." No, it isn't, and even a cursory reading of the indictment reveals why. On the anti-Swartz side, the district attorney says things like "Stealing is stealing whether you use a computer command or a crowbar, and whether you take documents, data or dollars." If you can't see a difference between, say, posting one of your articles on your website and lifting a neighbor's stereo, well then I don't know what. There's lots of hyperbole going on on both sides.

Here's my view: Insofar as his intentions were to further the goals of proponents of open access (and no one is more of a proponent than I), the techniques he chose to employ were, to quote Dennis Blair, "not moral, legal, or effective."

If the claims in the indictment are true, his actions certainly were not legal. The simple act of downloading the articles en masse was undoubtedly a gross violation of the JSTOR terms and conditions of use, which would have been incorporated into the agreement Swartz had entered into as a guest user of the MIT network. Then there is the breaking and entering, the denial of service attack on JSTOR shutting down its servers, the closing of MIT access to JSTOR. The indictment is itself a compendium of the illegalities that Swartz is alleged to have committed.

One could try to make an argument that, though illegal, the acts were justified on moral grounds as an act of civil disobedience, as Swartz says in his manifesto. "There is no justice in following unjust laws. It’s time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture." If this was his intention, he certainly made an odd choice of target. JSTOR is not itself a publisher "blinded by greed", or a publisher of any sort. It merely aggregates material published by others. As a nonprofit organization founded by academics and supported by foundations, its mission has been to "vastly improve access to scholarly papers", by providing online access to articles previously unavailable, and at subscription rates that are extraordinarily economical. It has in fact made good on that mission, for which I and many other OA proponents strongly support it. This is the exemplar of Swartz's villains, his "[l]arge corporations ... blinded by greed"? God knows there's plenty of greed to go around in large corporations, including large commercial publishing houses running 30% profit margins, but you won't find it at JSTOR. As a side effect of Swartz's activities, large portions of the MIT community were denied access to JSTOR for several days as JSTOR blocked the MIT IP address block in an attempt to shut Swartz's downloads down, and JSTOR users worldwide may have been affected by Swartz's bringing down several JSTOR servers. In all, his activities reduced access to the very articles he hoped to open, vitiating his moral imperative. And if it is "time to come into the light", why the concerted active measures to cover his tracks (using the MIT network instead of the access he had through his Harvard library privileges, obscuring his face when entering the networking closet, and the like)?

Finally, and most importantly, this kind of action is ineffective. As Peter Suber predicted in a trenchant post that we can now see as prescient, it merely has the effect of tying the legitimate, sensible, economically rational, and academically preferable approach of open access to memes of copyright violation, illegality, and naiveté. There are already sufficient attempts to inappropriately perform this kind of tying; we needn't provide further ammunition. Unfortunate but completely predictable statements like "It is disappointing to see advocates of OA treat this person as some kind of hero." tar those who pursue open access with the immorality and illegality that self-proclaimed guerrillas exhibit. In so doing, guerrilla OA is not only ineffective, but counterproductive.

I believe, as I expect Aaron Swartz does, that we need an editorially sound, economically sustainable, and openly accessible scholarly communication system. We certainly do not have that now. But moving to such a system requires thoughtful efforts, not guerilla stunts.

C'est la bouquet, or why translation is hard

18 Jul 2011

I used to use as my standard example of why translation is hard — and why fully automatic high-quality translation (FAHQT) is unlikely in our lifetimes however old we are — the translation of the first word of the first sentence of the first book of Proust's Remembrance of Things Past. The example isn't mine. Brown et al. cite a 1988 New York Times article about the then-new translation by Richard Howard. Howard chose to translate the first word of the work, longtemps, as time and again (rather than, for example, the phrase for a long time as in the standard Moncrieff translation) so that the first word time would resonate with the temporal aspect of the last word of the last volume, temps, some 3000 pages later. How's that for context?

I now have a new example, from the Lorin Stein translation of Grégoire Bouillier's The Mystery Guest. Stein adds a translator's note to the front matter “For reasons the reader will understand, I have refrained from translating the expression ‘C’est le bouquet.’ It means, more or less, ‘That takes the cake.’” That phrase occurs on page 14 in the edition I'm reading.

The fascinating thing is that the reader does understand, fully and completely, why the translator chose this route. But the reason is, more or less, because of a sentence that occurs on page 83, a sentence that shares no words with the idiom in question. True the protagonist perseverates on this latter sentence for the rest of the novella, but still, I challenge anyone to give an explanation in less than totally abstract terms, as far from the words actually used as you can imagine, to explain the reasoning, perfectly clear to any reader, of why the translator made this crucial decision.

Language is ineffable.

The NIH responds to my letter

07 Jun 2011

Front steps of National Library of Medicine, 2008, photo courtesy of NIH Image Bank

Imagine my surprise when I actually received a response to my letters in recognition of the NIH public access policy, a form letter undoubtedly, but nonetheless gratefully received. And as a side effect, it allows us to gauge the understanding of the issues in the pertinent offices.

The letter, which I've duplicated below in its entirety, addresses two of the issues that I raised in my letter, the expansion of the policy to other agencies and the desirability for a reduction in the embargo period.

With regard to expanding the NIH policy to other funding agencies, the response merely notes the America COMPETES Act's charge to establish a working group to study the matter — fine as far as it goes, but not an indication of support for expansion itself.

With regard to the embargo issue, the response seems a bit confused as to how things work in the real world. Let's look at some sentences from the pertinent paragraph:

"As you may know, the 12-month delay period specified by law (Division G, Title II, Section 218 of P.L. 110-161) is an upper limit. Rights holders (sometimes the author, and sometimes they transfer some or all of these rights to publishers) are free to select a shorter delay period, and many do." This is of course true. My hope, and that of many others, is to decrease this maximum.
"The length of the delay period is determined through negotiation between authors and publishers as part of the copyright transfer process." Well, not so much. Authors don't so much negotiate with publishers as just sign whatever publishers put in their path. When one actually attempts to engage in negotiation, sadly rare among academic authors, things often go smoothly, but sometimes take a turn for the odd, and authors in the thrall of publish or perish are short on negotiating leverage.
"These negotiations can be challenging for authors, and our guidance (http://publicaccess.nih.gov/FAQ.htm#778) encourages authors to consult with their institutions when they have questions about copyright transfer agreements." I have a feeling that the word challenging is a euphemism for something else, but I'm not sure what. The cited FAQ doesn't in fact provide guidance on negotiation, but just language to incorporate into a publisher agreement to make it consistent with the 12-month embargo. No advice on what to do if the publisher refuses, much less how to negotiate shorter embargoes. As for the excellent advice to "consult with their institutions", in the case of Harvard, that kind of means to talk with my office, doesn't it? Which, I suppose, is a vote of confidence.

So there is some room for improvement in understanding the dynamic at play in author-publisher relations, but overall, I'm gratified that NIH folks are on top of this issue and making a good faith effort to bring the fruits of research to the scholarly community and the public at large, and reiterate my strong support of NIH's policy.

Here's the full text of the letter:

DEPARTMENT OF HEALTH & HUMAN SERVICES
Public Health Service
National Institutes of Health
Bethesda, Maryland 20892

May 27 2011

Stuart M. Shieber, Ph.D.
Welch Professor of Computer Science, and
Director, Office for Scholarly Communication
1341 Massachusetts Avenue
Cambridge, Massachusetts 02138

Dear Dr. Shieber:

Thank you for your letters to Secretary Sebelius and Dr. Collins regarding the NIH Public Access Policy. I am the program manager for the Policy, and have been asked to respond to you directly.

We view the policy as an important tool for ensuring that as many Americans as possible benefit from the public's investment in research through NIH.

I appreciate your suggestions about reducing the delay period between publication and availability of a paper on PubMed Central. As you may know, the 12-month delay period specified by law (Division G, Title II, Section 218 of P.L. 110-161) is an upper limit. Rights holders (sometimes the author, and sometimes they transfer some or all of these rights to publishers) are free to select a shorter delay period, and many do. The length of the delay period is determined through negotiation between authors and publishers as part of the copyright transfer process. These negotiations can be challenging for authors, and our guidance (http://publicaccess.nih.gov/FAQ.htm#778) encourages authors to consult with their institutions when they have questions about copyright transfer agreements.

I also appreciate your suggestion to expand this Policy to other Federal science funders, and the confidence it implies in our approach. The National Science and Technology Council (NSTC) has been charged by the America COMPETES Reauthorization Act of 2010 (P.L. 111-358) to establish a working group to explore the dissemination and stewardship of peer reviewed papers arising from Federal research funding. I am copying Dr. Celeste Rohlfing at the Office of Science and Technology Policy on this correspondence, as she is coordinating the NSTC efforts on Public Access.

Sincerely,

Neil M. Thakur, Ph.D.
Special Assistant to the NIH Deputy Director for Extramural Research

cc: Ms. Celeste M. Rohlfing
Assistant Director for Physical Sciences
Office of Science and Technology Policy
Executive Office of the President
725 17th Street, Room 5228
Washington, DC 20502

The benefits of copyediting

04 Jun 2011

dictionary and red pencil — Dictionary and red pencil, photo by novii, on Flickr

Sanford Thatcher has written a valuable, if anecdotal, analysis of some papers residing on Harvard’s DASH repository (Copyediting’s Role in an Open-Access World, Against the Grain, volume 23, number 2, April 2011, pages 30-34), in an effort to get at the differences between author manuscripts and the corresponding published versions that have benefited from copyediting.

“What may we conclude from this analysis?” he asks. “By and large, the copyediting did not result in any major improvements of the manuscripts as they appear at the DASH site.” He finds that “the vast majority of changes made were for the sake of enforcing a house formatting style and cleaning up a variety of inconsistencies and infelicities, none of which reached into the substance of the writing or affected the meaning other than by adding a bit more clarity here and there” and expects therefore that the DASH versions are “good enough” for many scholarly and educational uses.

Although more substantive errors did occur in the articles he examined, especially in the area of citation and quotation accuracy, they were typically carried over to the published versions as well. He notes that “These are just the kinds of errors that are seldom caught by copyeditors.”

One issue that goes unmentioned in the column is the occasional introduction of errors by the typesetting and copyediting process itself. This used to happen with great frequency in the bad old days when publishers rekeyed papers to typeset them. It was especially problematic in fields like my own, in which papers tend to have large amounts of mathematical notation, which the typesetting staff had little clue about the niceties of. These days more and more journals allow authors to submit LaTeX source for their articles, which the publisher merely applies the house style file to. This practice has been a tremendous boon to the accuracy and typesetting quality of mathematical articles. Still, copyediting can introduce substantive errors in the process. Here’s a nice example from a paper in the Communications of the ACM:

“Besides getting more data, faster, we also now use much more sophisticated learning algorithms. For instance, algorithms based on logistic regression and that support vector machines can reduce by half the amount of spam that evades filtering, compared to Naive Bayes.” (Joshua Goodman, Gordon V. Cormack, and David Heckerman, Spam and the ongoing battle for the inbox, Communications of the Association for Computing Machinery, volume 50, number 2, 2007, page 27. Emphasis added.)

Any computer scientist would immediately see that the sentence as published makes no sense. There is no such thing as a “vector machine” and in any case algorithms don’t support them. My guess is that the author manuscript had the sentence “For instance, algorithms based on logistic regression and support vector machines can reduce by half...” — without the word that. The copyeditor apparently didn’t realize that the noun phrase support vector machine is a term of art in the machine learning literature; the word support was not intended to be a verb here. (Do a Google search for vector machine. Every hit has the phrase in the context of the term support vector machine, at least for the pages I looked at before boredom set in.)

Presumably, the authors didn’t catch the error introduced by the copyeditor. The occurrence of errors of this sort is no argument against copyediting, but it does demonstrate that it should be viewed as a collaborative activity between copyeditors and authors, and better tools for collaboratively vetting changes would surely be helpful.

In any case, back to Dr. Thatcher's DASH study. Ellen Duranceau at MIT Libraries News views the study as “support for the MIT faculty’s approach to sharing their articles through their Open Access Policy”, and the same could be said for Harvard as well. However, before we declare victory, it’s worth noting that Dr. Thatcher did find differences between the versions, and in general the edits were beneficial.

The title of Dr. Thatcher’s column gets at the subtext of his conclusions, that in an open-access world, we’d have to live with whatever errors copyediting would have caught, since we’d be reading uncopyedited manuscripts. But open-access journals can and do provide copyediting as one of their services, and to the extent that doing so improves the quality of the articles they publish and thus the imprimatur of the journal, it has a secondary benefit to the journal of improving its brand and its attractiveness to authors.

I admit that I’m a bit of a grammar nerd (with what I think is a nuanced view that manages to be linguistically descriptivist and editorially prescriptivist at the same time) and so I think that copyediting can have substantial value. (My own writing was probably most improved by Savel Kliachko, an outstanding editor at my first employer SRI International.) To my mind, the question is how to provide editing services in a rational way. Given that the costs of copyediting are independent of the number of accesses, and that the value accrues in large part to the author (by making him or her look like less of a halfwit for exhibiting “inconsistencies and infelicities” and occasionally more substantive errors), it seems reasonable that authors ought to pay publishers a fee for these services. And that is exactly what happens in open-access journals. Authors can decide if the bargain is a good one on the basis of the services that the publisher provides, including copyediting, relative to the fee the publisher charges. As a result, publishers are given incentive to provide the best services for the dollar. A good deal all around.

Most importantly, in a world of open-access journals the issue of divergence between author manuscripts and publisher versions disappears, since readers are no longer denied access to the definitive published version. Dr. Thatcher concludes that the benefits of copyediting were not as large as he would have thought. Nonetheless, however limited the benefits might be, properly viewed those benefits argue for open access.

The Scouring of the White Horse

18 Apr 2011

The owld White Harse wants zettin to rights
And the Squire hev promised good cheer,
Zo we'll gee un a scrape to kip un in zhape,
And a'll last for many a year.

— Thomas Hughes, The Scouring of the White Horse, 1859

On a recent trip to London, I had an extra day free, and decided to visit the Uffington White Horse with a friend. The Uffington White Horse is one of the most mysterious human artifacts on the planet. In the south of Oxfordshire, less than two hours west of London by Zipcar, it sits atop White Horse Hill in the Vale of White Horse to which it gives its name. It is the oldest of the English chalk figures, which are constructed by removing turf and topsoil to reveal the chalk layer below.

The Uffington White Horse, photo by flickr user superdove, used by permission

The figure is sui generis in its magnificence, far surpassing any of the other hill figures extant in England. The surrounding landscape — with its steep hills, the neighboring Roman earthworks castle, and pastoral lands still used for grazing sheep and cows — is spectacular.

The Uffington horse is probably best known for its appearance in Thomas Hughes’s 1857 novel Tom Brown's Schooldays. The protagonist Tom Brown, like Hughes himself, hails from Uffington, and Hughes uses that fact as an excuse to spend a few pages detailing the then-prevalent theory of the origin of the figure, proposed by Francis Wise in 1738, that the figure was carved into the hill in honor of King Æthelred’s victory over the Danes there in 871.^[1]

As it turns out, in a triumph of science over legend, Oxford archaeologists have dated the horse more accurately within the last twenty years. They conclude that the trenches were originally dug some time between 1400 and 600 BCE, making the figure about three millennia old.^[2]

How did the figure get preserved over this incredible expanse of time? The longevity of the horse is especially remarkable given its construction. The construction method is a bit different from its popular presentation as a kind of huge shallow intaglio, revealing the chalk substrate. Instead it is constructed as a set of trenches dug several feet deep and backfilled with chalk. Nonetheless, over time, dirt will overfill the chalk areas and grass will encroach. Over a period of decades, this process leads chalk figures to become "lost". In fact, several lost chalk figures in England are known of.

Chalk figures thus require regular maintenance to prevent overgrowing. Thomas Baskerville^[3] captures the alternatives: "some that dwell hereabout have an obligation upon their lands to repair and cleanse this landmark, or else in time it may turn green like the rest of the hill and be forgotten."

Figure from Hughes's *The Scouring of the White Horse* depicting the 1857 scouring. From the 1859 Macmillan edition.

This "repairing and cleansing" has been traditionally accomplished through semi-regular celebrations, called scourings, occurring at approximately decade intervals, in which the locals came together in a festival atmosphere to clean and repair the chalk lines, at the same time participating in competitions, games, and apparently much beer. Hughes's 1859 book The Scouring of the White Horse is a fictionalized recounting of the 1857 scouring that he attended.^[4]

These days, the regular maintenance of the figure has been taken over by the National Trust, which has also arranged for repair of vandalism damage and even for camouflaging of the figure during World War II.

The author at the Uffington White Horse, 19 March 2011, with Dragon Hill in the background. Note the beginnings of plant growth on the chalk substrate.

Thus, the survival of the Uffington White Horse is witness to a continuous three millennium process of active maintenance of this artifact. As such, it provides a perfect metaphor for the problems of digital preservation. (Ah, finally, I get to the connection with the topic at hand.) We have no precedent for long-term preservation of interpretable digital objects. Unlike books printed on acid-free paper, which survive quite well in a context of benign neglect, but quite like the White Horse, bits degrade over time. It requires a constant process of maintenance and repair — mirroring,^[5] verification, correction, format migration — to maintain interpretable bits over time scales longer than technology-change cycles. By coincidence, those time scales are about commensurate with the time scales for chalk figure loss, on the order of decades.

The tale of the Uffington White Horse provides some happy evidence that humanity can, when sufficiently motivated to establish appropriate institutions, maintain this kind of active process over millennia, but also serves as a reminder of the kind of loss we might see in the absence of such a process. The figure is to my knowledge the oldest extant human artifact that has survived due to continual maintenance. In recognition of this, I propose that we adopt as an appropriate term for the regular processes of digital preservation "the scouring of the White Horse".

[A shout out to the publican at Uffington's Fox and Hounds Pub for the lunch and view of White Horse Hill after our visit to the horse.]

^[1]Francis Wise, A Letter to Dr. Mead concerning some antiquities in Berkshire; Particularly shewing that the White Horse, which gives name to the Vale, is a Monument of the West Saxons, made in memory of great Victory obtained over the Danes A.D. 871, 1758.

^[2]David Miles and Simon Palmer, "White Horse Hill," Current Archaeology, volume 142, pages 372-378, 1995.

^[3]Thomas Baskerville, The Description of Towns, on the Road from Faringdon to Bristow and Other Places, 1681.

^[4]One of the salutary byproducts of the recent mass book digitization efforts is the open availability of digital versions of both Hughes books: through Open Library and Google Books.

^[5]Interestingly, the Uffington White Horse has been "mirrored" as well, with replicas in Hogansville, GA, Juarez, Mexico, and Canberra, Australia.

Newer Older

The Occasional Pamphlet ...on scholarly communication