Monday 29 July 2013

Why digital humanists should get out of textual scholarship. And if they don't, why we textual scholars should throw them out.

Three weeks ago, when I was writing my paper for the conference on Social, Digital, Scholarly Editing I organized (with lots of help -- thanks guys!) at Saskatoon, I found myself writing this sentence:

    Digital humanists should get out of textual scholarship: and if they will not, textual scholars 
     should throw them out.

  As I wrote it, I thought: is that saying more than I mean?  Perhaps-- but it seemed to me something worth saying, so I left it in, and said it, and there was lots of discussion, and some misunderstanding too.  Rather encouraged by this, I said it again the next week, this time at the ADHO conference in Nebraska. (The whole SDSE paper, as delivered, is now on; the slides from Nebraska are at slideshare -- as the second half of a paper, following my timeless thoughts on what a scholarly digital edition should be).

So, here are a few more thoughts on the relationship between the digital humanities and textual scholarship, following the discussion which the two papers provoked, and numerous conversations with various folk along the way.

First, what I most definitely did not mean.  I do not propose that textual scholars should reject the digital world, go back to print, it's all been a horrible mistake, etc. Quite the reverse, in fact. Textual scholarship is about communication, and even more than other disciplines, must change as how we communicate changes. The core of my argument is that the digital turn is so crucial to textual scholars that we have to absorb it totally -- we have to be completely aware of all its implications for what we do, how we do it, and who we are.  We have to do this, ourselves. We cannot delegate this responsibility to anyone else.  To me, too much of what has gone on at the interface between textual scholarship and the digital humanities the last few years has been exactly this delegation.  There were good reasons for this delegation over the first two decades (roughly) of the making of digital editions. The technology was raw; digital humanists and scholarly editors had to discover what could be done and how to do it.  The prevailing model for this engagement was: one scholar, one project, one digital humanist, hereafter 1S/1P/1DH.  Of course there are many variants on this basic pattern.  Often, the one digital humanist was a team of digital humanists, typically working out of a single centre, or was a fraction of a person, or the one scholar might be a group of scholars.  But, the basic pattern of close, long and intensive collaboration between the 'one scholar' and the 'one digital humanist' persists. This is how I worked with Prue Shaw on the Commedia and Monarchia; with Chris Given-Wilson and his team on the Parliament Rolls of Medieval England; and how many other digital editions were made at Kings College London, MITH, IATH, and elsewhere.

This leads me to my second point, which has (perhaps) been even more misunderstood than the first point. I am not saying, at all, that because of the work done up to now, all the problems have been solved, we have all the tools we need, we can now cut the ties between textual scholarship and digital humanities and sail the textual scholarly ship off into the sunset, unburdened by all those pesky computer folk. I am saying that this mode of collaboration between textual scholars and digital humanists, as described in the last paragraph, has served its purpose.  It did produce wonderful things, it did lead to a thorough understanding of the medium and what could be done with it. However, there are such problems with this model that it is not just that it is not needed: we should abandon it for all but a very few cases. The first danger, as I have suggested, is that it leads to textual scholars relying over-much on their digital humanist partners. I enjoyed, immensely, the privilege of two decades of work with Prue Shaw on her editions of Dante.  Yet I feel, in looking back (and I know Prue will agree) that too many times, I said to her -- we should do this, this way; or we cannot do that.  I think these would have been better editions if Prue herself had been making more decisions, and me fewer (or even none).  As an instance of this: Martin Foys' edition of the Bayeux Tapestry seems to me actually far better as an edition, in terms of its presentation of Martin's arguments about the Tapestry, and his mediation of the Tapestry, than anything else I have published or worked on.  And this was because this really is Martin's edition: he conceived it, he was involved in every detail, he thought long and hard about exactly how and what it would communicate.  (Of course, Martin did not do this himself, and of course he relied heavily on computer engineers and designers -- but he worked directly with them, and not through the filter of a 'digital humanist', ie, me). And the readers agree: ten years on, this is still one of the best-selling of all digital editions.

The second danger of this model, and one which has already done considerable damage, is that the digital humanist partners in this model come to think that they understand more about textual editing than they actually do -- and, what is worse, the textual editors come to think that the digital humanists know more than they do, too. A rather too-perfect example of this is what is now chapter 11 of the TEI guidelines (in the P5 version).  The chapter heading is "Representation of Primary Sources", and the opening paragraphs suggest that the encoding in this chapter is to be used for all primary source materials: manuscripts, printed materials, monumental inscriptions, anything at all.  Now, it happens that the encoding described in this chapter was originally devised to serve a very small editorial community, those engaged in the making of "genetic editions", typically of the draft manuscripts of modern authors. In these editions, the single document is all-important, and the editor's role is to present what he or she thinks is happening in the document, in terms of its writing process. In this context, it is quite reasonable to present an encoding optimized for that purpose. But what is not at all reasonable is to presume that this encoding should apply to every kind of primary source.  When we transcribe a manuscript of the Commedia, we are not just interested in exactly how the text is disposed on the page and how the scribe changed it: we are interested in the text as an instance of the work we know as the Commedia.  Accordingly, for our editions, we must encode not just the "genetic" text of each page: we need to encode the text as being of the Commedia, according to canticle, canto and line. And this is true for the great majority of transcriptions of primary sources: we are encoding not just the document, but the instance of the work also.  Indeed, it is perfectly possible to encode both document and work instance in the one transcription, and many TEI transcriptions do this.  For the TEI to suggest that one should use a model of transcription developed for a small (though important) fraction of editorial contexts for all primary sources, the great majority of which require a different model, is a mistake.

Another instance of this hubris is the preoccupation with TEI encoding as the ground for scholarly editing. Scholarly editors in the digital age must know many, many things.  They must know how texts are constructed, both as document and work instance; they must know how they were transmitted, altered, transformed; they must know who the readers are, and how to communicate the texts and what they know of them using all the possibilities of the digital medium. What an editor does not need to know is exactly what TEI encoding should be used at any point, any more than editors in the print age needed to know what variety of linotype printer was in use.  While the TEI hegemony has created a pleasant industry in teaching TEI workshops, the effect has been to mystify the editorial process, convincing rather too many prospective editors that this just too difficult for them to do without -- guess what -- a digital humanist specialist.  This, in turn, has fed what I see as the single most damaging product of the continuation of the1S/1P/1DH model: that it disenfranchises all those scholars who would make a digital edition, but do not have access to a digital humanist.  As this is almost every textual scholar there is, we are left with very few digital editions. This has to change. Indeed, multiple efforts are being made towards this change, as many groups are attempting to make tools which (at least in theory) might empower individual editors.  We are not there yet, but we can expect in the next few tools a healthy competition as new tools appear.

A final reason why the 1S/1P/1DH model must die is the most brutal of all: it is just too expensive. A rather small part of the Codex Sinaiticus project, the transcription and alignment of the manuscripts, consumed grant funding of around £275,000; the whole project cost many times more.  Few editions can warrant this expenditure -- and as digital editions and editing lose their primary buzz, funding will decrease, not increase.  Throw in another factor: almost all editions made this way are data siloes, with the information in them locked up inside their own usually-unique interface, and entirely dependent on the digital humanities partner for continued existence.

In his post in response to the slides of my Nebraska talk, Joris van Zundert speaks of "comfort zones". The dominance of the 1S/1P/1DH model, and the fortunate streams of funding sustaining that model, has made a large comfort zone. The large digital humanities centres have grown in part because of this model and the money it has brought them -- and have turned the creation of expensively-made data, dependent on them for support, as a rationale for their own continued existence. What is bad for everyone else -- a culture where individual scholars can make digital editions only with extraordinary support -- is good for them, as the only people able to provide that support.  I've written elsewhere about the need to move away from the domination of digital humanities by a few large centres (in my contribution to the proceedings of last year's inaugural Australian Association of Digital Humanities conference).

This comfort zone is already crumbling, as comfort zones tend to do.  But beside the defects of the 1S/1P/1DH, a better reason for its demise is that a better model exists, and we are moving towards it.  Under this model, editions in digital form will be made by many people, using a range of online and other tools which will permit them to make high-quality scholarly editions without having to email a digital humanist every two minutes (or ever, even). There will be many such editions.  But we will have gained nothing if we lock up these many editions in their own interfaces, as so many of us are now doing, and if we wall up the data by non-commercial or other restrictive licenses.

This is why I am at such pains to emphasize the need for this new generation of editions to adopt the creative commons attribution share-alike licence, and to make all created materials available independent of any one interface, as the third and fourth desiderata I list for scholarly editions in the digital age. The availability of all this data, richly marked up according to TEI rules and supporting many more uses than the 'plain text' (or 'wiki markup') transcripts characteristic of the first phase of online editing tools, will fuel a burgeoning community of developers, hacker/scholars, interface creators, digital explorers of every kind. I expressed this in my Nebraska talk this way:

Under this model we can look to many more digital humanists working with textual scholarly materials, and many more textual scholars using digital tools. There will still be cases where the textual scholar and the digital humanist works closely together, as they have done under the 1S/1P/1DH model, in the few scholarly edition projects which are of such size and importance to warrant and fund their own digital support.  (I hope that Troy Griffitts has a long and happy time ahead of him, supporting the great editions of the Greek New Testament coming from Münster and Birmingham). But these instances will be not the dominant mode of how digital humanists and textual scholars will work together. At heart, the 1S/1P/1DH model is inherently restrictive. Only a few licenced people can work with the data made from any one edition.  Instead, as Joris says, we should seek to unlock the "highly intellectually creative and prolific" potential of the digital environment, by allowing everyone to work with what we make.  In turn, this will fuel the making of more and better tools, which textual scholars can then use to make more and better editions, in a truly virtuous circle.

Perhaps I overdramatized matters, by using a formula suggesting that digital humanists should no longer have anything to do with textual scholarship, when I meant something different: that the model of how digital humanists work with textual scholars should change -- and is changing.  I think it is changing for the better. But to ensure that it does, we should recognize that the change is necessary, work with it rather than against it, and determine just what we would like to see happen.  It would help enormously if the large digital humanities centres, and the agencies which fund them, subscribed whole-heartedly to my third and fourth principles: of open data, available through open APIs.  The first is a matter of will; the second requires resources, but those resources are not unreasonable.  I think that it will be very much in the interests of the centres to adopt these rules.  Rather quickly, this will exponentially increase the amount of good data available to others to use, and hence incite others to create more, and in turn increase the real need for centres specializing in making the tools and other resources textual scholars need.  So, more textual scholars, more digital humanists, everyone wins.


  1. I agree wholeheartedly with the spirit of what you are saying, Peter. But technically this is not going to work. You can't use TEI-XML data with its embedded interpretations and complex markup as the basis of a collaborative edition for everyone. If the past 25 years have taught us anything it is that. Try re-reading the original grant application for the TEI from 1988 It says there that a common encoding format will lead to the development of software that utilises it and "the texts may be used with any software package", i.e. that the texts will be interoperable. That did not happen, and no one else is claiming that now. What Syd Bauman said at Balisage 2011 ( is that interoperability of TEI encoded data is impossible. He should know: he was co-author of the P5 Guidelines. And Martin Mueller said that
    "TEI encoded texts right now offer no advantage over plain text, HTML or epub texts.... But what about the added value of TEI specific encoding for the historian, linguist, philosopher, literary critic etc? How can they decode or get at it, and what does it do for them? The answer is that for the most part they cannot get at it at all."
    And when he said that he was chairman of the TEI board.

  2. Desmond, I agree entirely *if* one has to work with the 'raw' XML, using tools like Oxygen, etc., and then make your own publishing system, etc. But we see all around us on the web systems using HTML5 -- which is really no less complex than TEI-lite, at the least -- performing miracles of interoperability. Facebook, airline and hotel booking search engines, Google maps, your newspaper online: look at the source of any of these and its horrendous complexity is way beyond anything the TEI has dreamed up. It's all to do with the tools, and the way they integrate with each other, and with the systems they use. A critical part of this is that it is not enough to rely on XML and XML alone: all those systems I just named use a mass of methods (databases, especially) to store and move data about. We need tools and systems which will work with each other, and with all the derivatives of XML, to do what we want. I think in fact the direction things have gone in the last years has actually been the enemy of this development. Most of the TEI XML made in the last decades lives with the people who made it (typically, a digital humanities centre). Freeing all this data up would take us a huge step towards the kinds of interoperability we need (and see everywhere else).

  3. Peter, there's a big difference between the markup systems you mention, such as HTML5, which is an interoperable standard, and TEI-XML, which isn't. And there's a big difference between getting machines to encode information into a fixed database structure or into HTML5 and asking humans to encode features they see in manuscripts or printed books. The first is mediated by a calculating machine the second by the human brain. And the result is, predictably, entirely different, and must always be. Humans interpret what they see; machines don't.
    Let me give you an example. When I supervised students and professional encoders for our Wittgenstein edition in 1992-2001 I got the set of tags we used down to a single page of instructions. I trained each encoder, and inspected their work. Some of them were with us for years. And you know what? No one, from the director down to the most recent student helper encoded the same features in the same way. Even the same encoder would use different tags to record the same features on different days. The only way we could ensure that the encoding could be processed by one application consistently was for me to go over everything and make it so.
    Even something as mudane as italics or a title can be encoded in myriad ways. Take a look at what Patrick Durusau said in Electronic Textual Editing: one sentence in a printed book transcribed in TEI can be encoded in 4 million different ways. The chance that any two people can accidentally hit upon the same encoding and that encoding can just happen to be what is expected by the software is practically zero. He gives the example of searching for "titles" in a corpus of texts prepared by different people. And only a small percentage of the "titles" actually turn up in the search because they were encoded inconsistently. For 25 years we have been encoding texts in TEI-XML and the interoperability we originally sought hasn't happened. And I don't see that is going to happen now merely as a result of calling for it yet again.
    The complexity issue is enveloped in this problem too. Firstly there are so many TEI tags that finding out if there is already one for your particular feature is pretty hard. If you can't find it you'll likely invent a new one, or misuse an existing one. Secondly there is no way to simplify the encoding to the point where your average member of the public or a novice XML encoding humanist can handle it. Lots of people have tried to produce nice simple interfaces for TEI by mapping tags to formats. But not all TEI features map to formats. What about variants? What about links, joins etc? You might be able to map a specific simple schema to a simple HTML format but you can't handle the general case. So again, I think this invalidates the idea of collaborative _TEI_ editions.
    What you're leaving out of the equation is the human element. We can't communicate like machines, and that includes communication via XML tags.

  4. I believe a major factor which keeps TEI encoding so varied is the absence of the killer tools. Let me explain. If humanities scholars had one tool which made their research life 10x easier, which operated on an interpretation of a subset of the TEI guidelines, the user base of this killer tool would have incentive to conform to a common interpretation of the TEI. A second major factor is that we still EXPECT HUMANITIES SCHOLARS TO ENCODE ANYTHING! When you say, give me a TEI editor, most people think of an eclipse plugin to assist them in in the markup. This is a sad state of affairs. A true WYSIWYG editor which mechanically produces TEI from user editing will drastically standardize a project's markup. For example:

    1. Sorry about the late response to this, but I only just saw it.
      With respect to the papyrological editor you point to it's very nice, though it is quite specific to the dialect of EpiDoc it understands. But a general TEI editor (not just a plain XML one) that can handle the variety of ways in which people understand and use the great variety of codes they need for each individual project, and which truly responds to the textual structures those codes describe, remains elusive. As you point out it is customisability that is the key. But have you ever reflected that constructing a general encoding scheme might actually be a bad idea? It forces everyone to implement codes they don't want. The way TEI is divided into tag and attribute groups everything overlaps with everything else. What people who have used Roma to do this have reported to me (and among them a TEI board member) is that you end up dragging along at least half of TEI every time you want to create a sub-dialect of it. And TEI is not just tags that create nice simple HTML formats. It's lots of other things that require programming to make them come alive. This means that you have to expose the user to the source code, and if the user then commits a syntax error in this complex hierarchical language you have to handle it right there in the editor. That's not very user friendly. So please excuse me for expressing scepticism that another 24 years of trying to create "killer" TEI tools will have any better result than the last 24 years.
      I believe that the answer lies in removing most of the functionality from TEI that shouldn't be in there, and then using a minimal markup language (MML) for each individual project that is just used to enter the data, then translating the MML into a general interoperable format like HTML, or plain text with external tags (TEXT+STIL) for storage on the server. Then when you want to edit it again, you translate it back to MML. Using this approach all you need is to define half a dozen tags for each project, because most of the complex stuff (metadata, annotations, variants) is handled elsewhere. Then you have a chance to create a killer aplication, but it won't use TEI.

  5. This comment has been removed by the author.