Tuesday, 20 July 2021

Fun with Fonts. Junicode, Unicode, and ꝑ

 If you see a character looking like a p with a bar through the descender in the title of this post, and you see it here too , then ... read on. And if you don't, then read on (and let me know!)

Thirty years ago, when myself, Tim Berners-Lee, Lou Burnard and the web were much younger, every "special character" was a challenge, and a potential triumph or failure. "Special" meant something beyond ASCI 127 (ah, the acronyms!). It meant anything non-English, in the most limited BREXIT sense. E-acute was used by people from across the Channel, and a few Canadians, and not to be used without Special Equipment (in those days, a Macintosh computer). Devanagari was a distant dream, and right-to-left writing, an impossibility.

Nowadays, thanks to Unicode, and the work of many unsung heroes of font-design, with a special shout-out to those who sat on myriad committees and shepherded the whole process to every smart phone on the planet, we have become so used to everything appearing just right, with no effort at all on our part, that we are in danger of forgetting how many miracles had to occur so that I can insert a in my document, and you can see it. (The best miracles are made by people working together, of course). But every now and then, something happens to remind us of how many ducks make a row.

Like many medievalists, I am a fan of Peter Baker's beautiful Junicode font. For years, I have been happily typing into transcriptions, Word and pdf documents. This and a few other characters are very common in many medieval vernacular and Latin manuscripts.  is used as an abbreviation for per or par, as in "person" and "parish", and so found everywhere in Chaucer manuscripts (think of the Parson and the Pardoner). One of the great joys of Junicode is that it shows this character in a particularly elegant form, appearing as 

Over the years, we have used Junicode in all our work with medieval texts, and have become so accustomed to the daily miracle of Junicode that we don't think about it. It works. "We" is all the people who work on the Canterbury Tales Project and a few other projects -- particularly Dante. I am currently working with various Dante scholars on a new publication, coming soon to a browser near you. Trust me, you will know about this when it happens. So, imagine my surprise when after so many years of trouble-free use, my main collaborator said that our elegant Junicode p with bar appeared as a horrid oversize black character on her computer, thus:

At first, I thought this was just an aberration, something odd about the way her computer was set up. The character appeared fine on my computer, and on various other computers I looked at, but not on hers. Why not? Down the rabbithole I went.

By this time, we had graduated to bundling the Junicode font with our developing site, so that readers would not have to download the font to their computer. This a well-documented process, and Squirrel font documents it and provides neat tools to convert any font to a "webfont", easily embeddable in any web page. So I began investigating. On my computer, the character appeared fine:

  • if I had Junicode on my computer, and the font embedded in the page
  • if I had Junicode on my computer, and the font NOT embedded in the page
It did NOT appear fine if I did NOT have Junicode on my computer and had only Junicode embedded in the web page. Yet the web page showed Junicode everywhere else -- but not this character, and a few other characters. How could this be? 

I began digging. The unicode code point for p with a bar is A751. This is in the "general use" area of unicode, which major fonts will support as a matter of course: so you can paste the ꝑ from this document into a Word document and use it in Times New Roman, Geneva, etc. When I looked at Junicode in my computer, using Apple's Font Book, p with a bar appeared as glyph 2007, Unicode A751, exactly as it should:

However, on my collaborator's computer, the same character appeared in a quite different place: as glyph 2066, unicode E670 (on my computer, Junicode has a quite different character at glyph 2007). 

What is going on? Why is her Junicode different from mine? On digging about, it appears that some time in the past, Junicode indeed had this character at E670. The "E" and "F" unicode ranges are "Private Use" areas, and it appears that up to the time when p with a bar was allocated A751 in the "general use" area, Junicode put p with a bar in the "private use" area, with that encoding. This is a rather long story, involving a group called the Medieval Unicode Font Initiative (MUFI). One of the aims of this group was to have "core" characters judged as essential to scholars working with medieval western European texts incorporated into the "official" Unicode encoding. As of Unicode 5.1, 152 MUFI characters -- among them, p with a bar -- had made it into official unicode. It appears that my version of Junicode reflects this shift of p with a bar into official, post 5.1, unicode. The version of Junicode on Prue's computer did not.

More digging. By this time, I was suspecting that the embeddable version of Junicode did not have p with a bar at A751. But why did it display correctly on my computer? It appears that somewhere deep in the innards was an instruction to the effect: if the browser could not find the character in the embedded font, look elsewhere: so it looked in the Junicode on my computer, found it and displayed it. It did this even when I tried to fool it by calling the embedded font something else in the CSS ("junicoderegular") style sheet. However, on my collaborator's computer the character did not appear as A751, and so it showed an A751 from another font altogether.

Eventually, after scores of emails and hours of digging, I concluded that the root of the problem lay in the embedded font. Somehow, this embedded Junicode did not have p bar where it should be. So I set to trying to correct this. First I went to the Squirrel font generator:

I uploaded the Junicode TTF from my computer, Squirrel converted it to a "webfont", and all seemed fine. Nope. Same problem. I dug deeper. I went to Peter Baker's "Junicode" page on FontSquirrel and used the "webfont kit" generator on that page. Nope. Same problem. With increasing desperation, I noticed that the page offered a choice of "subsets":

So, I chose "no subsetting" and created the webfont. And at last! it worked!

All this for characters which appear just five times in some 2400 pages of manuscript transcription. 

This tale casts into relief the many rough edges that exist in the interplay of fonts, glyphs, character coding points, unicode spaces, and encoding systems (utf8? or 16? BOM or not?), all playing against multiple versions as all of these evolve and agreements are forged and renewed. The wonder is that problems like these occur so rarely.

Tuesday, 13 February 2018

Getting Started with Textual Communities

Welcome to the temporary home of Version 2 of Textual Communities ("TC"), at textcomtest.usask.ca. This address will change when we are ready to go fully public with TC. Until then, this is a sandbox version, and all data may disappear at any point.
If you just want to see what TC can do: choose a community from "Public Communities", and "View".

Sample files

You can get the sample files used in this documentation at www.sd-editions.com/tc. You can download all the files in this directory in a single zipfile at www.sd-editions.com/tc/tcstart.zip

Logging in

Here is what you see:

Press the inviting "Start" button, and you will be asked to log in by social media, or create a log-in using your email address. If you do the latter, you will be sent an email to that address to confirm your registration. (Note: TC uses email addresses to uniquely identify each user).

Creating or joining a community

When you first log in as a new user, the Start button has changed:

The "Create Community" button brings you to this screen:

The two compulsory fields, "Name" and "Abbreviation", are marked with *. Note the accessibility options: you can hide your community from everyone, or allow anyone to do anything, and many options in between.

Your first document: an XML file

Once you have a community, you need documents! The "Start" button at the centre of the screen has changed again:

Choose "Add Document" and you are offered two choices:

This time, select the "XML file" option. TC likes TEI! Here is a very simple example of a TEI/XML file, optimized for TC use:

<?xml version="1.0" ?> 
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<p>Draft for Textual Communities site (spelling modernized)</p>
<sourceDesc><p>Murray McGillivray</p></sourceDesc>
  <pb n="130r" facs="FF130R.JPG"/>
<div n="Book of the Duchess">
        <lb/><head n="Title">The book of the Duchesse</head>
        <lb/><l n="1">I Have great wonder/ be this light</l>
        <lb/><l n="2">How that I live/ for day nor night</l>
        <lb/><l n="3">I may nat slepe/ wel nigh nought</l>
        <lb/><l n="4">I have so many/ an idel thought</l>
        <lb/><l n="5">Purely/ for default of sleep</l>
        <lb/><l n="6">That by my truthe/ I take no keep</l>
        <lb/><l n="7">Of no thing/ how it cometh or goth</l>
        <lb/><l n="8">Ne me is no thing/ leief nor loth</l>
        <lb/><l n="9">Al is y like good / to me</l>
        <lb/><l n="10">Joy or sorrow / where so it be</l>

There are a few things to note about this file:
  • "Content" elements with "n" attributes (<l n="1">) are especially important to TC. TC uses these to identify all content sections. Thus: the first line is labelled by TC as "div=Book of the Duchess:l=1", and TC then uses this identifier to locate all versions of the first line in every document
  • Note the explicit use of <lb/> elements to mark each document new line. TC uses the implicit hierarchy of page, column and line breaks (<pb/> <cb/> <lb/>) to construct a "text-tree" for each document, alongside the "text-tree" it creates for the hierarchy of <div> and <l> elements.
TC's understanding, that every text is composed of two distinct text-trees, one for the document (<pb/> <lb/> etc) and one for the act of communication represented in the document (<div>, <l> etc), is what separates TC from other systems for creating scholarly editions.

Adding more documents, adding images

After selecting "XML file" you will get this dialogue:

Choose the file "Fairfax.xml" from the sample files (see above), give it the name "Ff" (or similar), and press "Load".
You will receive various encouraging messages, and the window should change to show you the sigil for this manuscript in the left hand pane:
Click on the arrow beside Ff to see the pages in Ff, and then click on the first page. Its transcription will now appear in bottom right pane:

Now, you can add an image to the page. You can do this in several ways:
  • Click on the "Add Image" button in the top-right pane, or the camera icon beside the page number "130r". You will get a box inviting you to choose an image file or drop it onto the dialogue. Choose FF130R.JPG from the sample files.
  • You can load multiple images by putting them all in a folder, zipping the folder, and then clicking on the ZIP icon next to the manuscript name. Choose FairfaxImages.zip from the sample files.
In either case, you will see the image appear in the top right pane. The red camera icon beside each page which now has an image will turn black. If you have all the images for the manuscript, the multiple image icon (two cameras above one another) will also turn black:

Play around with the other icons on this page. Try pressing the "Save" "Preview" and "Commit" buttons, to see what happens. (Note: "Commit" will write the page to the underlying database.)
Add another document by clicking on the + icon in the left hand pane. Again, choose the "XML file" option, this time add "Bodley.xml" from the sample files, with the name Bd.


The power of Textual Communities may be seen in the Collation system. At the top of the left panel, click the "Collation" tab:

In TC terms, an "entity" is a discrete segment of an act of communication: a line of poetry, a paragraph of prose. Click on the arrow beside "Book of the Duchess" to open up the entities (lines of poetry) within it:
(The order of these may vary.) Now, click on one of these lines. You will get this advice:
So, go to that menu:
Choose a base text (it does not matter which). Now, go back to click on line 1 in the collation. The right hand panel will change, to present the wonderful Collation Editor (developed originally for the Greek New Testament editing projects at Münster and Birmingham):
(You may need to make the window larger to see the menu at the bottom of the pane). Spend some time playing with this. You can regularize variants (e.g. remove the variant wonder/wondir) by dropping one word on another:
After choosing "Save", you will see that both manuscripts now have the reading "wonder":
Play with the settings menu. You can change how the collation works from this menu:
You will see how the collation changes as these selections change.
This brief introduction gives only a glimpse of the power of the Collation Editor. Try the following, for example:

  1. Go back to one of the documents, change line 1, commit the change (this writes it to the database used by the collation), and return to the collation. You will see your change there.
  2. Now, for fun: go to the second page of Ff (130v) and have line 38 continue from the previous page onto this page and add something to it. Hint: change the "From previous page" value:

Then, commit this change and return to the collation. You will see that line 38 now includes this extra text, across the page break. You can view the XML for this page by clicking on the XML icon beside the manuscript name, to comfirm that the line indeed continues across the page break:

Other facilities

There is a great deal more in TC than this sketch shows. It is particularly rich in community management features, as follows:
  1. You can invite other people to become members of your community (click on the "Members" link when you have chosen your community, or on the "Member profile" item on the log-in menu) and follow the "Invite" link
  2. You can change the status of any member, assign them pages to transcribe, check the progress of the transcription, assign them someone to approve their transcripts (the "Members" link for each community you lead)
  3. You can permit other people to join your community without need of your approval, or require that anyone who wants to join must be approved by you ("Member profile" on the log-in menu)
Further, you can permit anyone to access pages, whole documents, or any part of the text of any document, and import it to their own website.

Copyright, etc.

We encourage anyone contributing materials to TC to make these available under the Creative Commons Attribution (CC-A) license. That is: no share-alike and no "non-commercial" restrictions. This means there should no restrictions at all except requiring all subsequent users of the material to acknowledge your part in making it.
For the time being: TC will accept materials which do have restrictions on them. However, it is likely that TC in future will require that all materials held on TC servers are free of all restrictions (CC-A or similar). This is because TC uses University of Saskatchewan and Compute Canada servers. As both are publicly funded, hosting materials with any kind of copyright restrictions raises legal and ethical issues.
If this is a problem for you, you should not use TC.

Some interesting features of TC

Here, in no particular order, are some aspects of TC which make it unusual, even unique:
  • TC is built on an explicit ontology of texts, documents and works. Various of my publications describe this ontology (see https://www.academia.edu/12297061/Some_principles_for_the_making_of_collaborative_scholarly_editions_in_digital_formhttps://www.academia.edu/9575974/The_Concept_of_the_Work_in_the_Digital_Age_published_version_https://www.academia.edu/3233227/Towards_a_Theory_of_Digital_Editions). Briefly: TC sees text as a collection of leaves, with all leaves present on two distinct trees, each of which conforms precisely to the "OHCO" (ordered hierarchy of content objects) model. One of the trees represents the document (codex/quires/pages/columns/lines). The other tree represents the act of communication ("entity") inscribed in the document: as Play/Scenes/Acts/Lines, or Poem/Stanzas/Lines, etc. Note that this is not simply a matter of "overlapping hierarchies", as usually characterized. It is actually two quite distinct trees: distinct to the point that branches and their leaves might appear with quite different orders on the two trees (as in the case of notes or alterations spanning across the margins of multiple pages, etc.) Broadly, TC uses the 'document' tree to display the document page by page, line by line, and TC uses the 'entity' tree to locate units of text across multiple documents for collation.
  • XML and all the tools associated with it famously supports "one text, one tree". (Long ago, XML's predecessor SGML did attempt to enable multiple trees in any one text through the CONCUR feature. I never did discover a useful implementation of CONCUR.) Over some twenty-five years, I have tried to manipulate the two hierarchies using a variety of tools (most prominently, the Anastasia publishing system). One problem was that for long I thought the problem was simply "overlapping hierarchies", and not the more demanding scenario of two distinct trees. Another problem was the inefficiency of XML tools. Accordingly, while TC uses XML as its standard input format, it creates the two distinct trees from the XML and then stores the two trees not as XML but as a series of JSON documents stored in a MongoDB backend. In essence, the text is a collection of leaves stored in JSON fields, with each leaf also stored in distinct JSON documents representing the two trees. Over the last decade I have attempted to express this model  with three different database systems: first, XML in the form of XML-DB; then SQL in a relational database (underlying the first version of TC, still to be seen at www.textualcommunities.usask.ca), and finally JSON. JSON wins. A key reason for the success of JSON was the requirement that we be able to edit pages in real time: that is, take out a chunk of each tree, rebuild both trees as needed and then reattach the leaves of text to each rebuilt tree, all while the editor watches. Doing this in real time is like gathering leaves in a howling gale. As a bonus, JSON (much more than XML) is the native language of web content, with an immense range of Javascript/HTML tools available to process it.
  • Technically: TC is build in pure javascript, using node.js and npm tools (https://nodejs.org/en/https://www.npmjs.com/), for both server and browser components. This makes maintenance, etc, far easier. TC also uses the Angular framework to provide all interface components (https://angularjs.org/; drawing on the Bootstrap and JQuery libraries). This architecture was designed by Xiaohan Zhang between 2012 (when we realized that the SQL solution would not work) and 2015. All code is freely available on Github, at https://github.com/DigitalResearchCentre/tc.
  • Theoretically: there is no limit to the number of trees structuring every text. TC supports two. Best of British luck to whoever wants to deal with more than two.
  • TC uses a IIIF server and viewer software (http://iiif.io/). In the future, we want to broaden our support for IIIF, to import full IIIF documents, etc.
  • We would like to be obsolete very very soon. Someone please do this better than we did.

Monday, 29 September 2014

The history of Collate

Historical note

This post was first published as part of the series of blogs detailing the move from Collate 2 to CollateX. I reproduce it here, in the run-up to the Munster Collation Summit on October 3 and 4, 2014, which might formally mark the final, irrevocable, irredeemable death of Collate 0 - 1- 2.

February 6, 2007
The History of Collate

Filed under: History, Anastasia: finding another — Peter @ 9:29 am

Collate 0 — Collate 1 — Collate 2

There have actually been three versions of Collate, up to now. The very first, Collate 0 if you like, I wrote in Spitbol on the DEC Vax in Oxford between 1986 and 1989. I wrote this to collate 44 manuscripts of the Old Norse narrative sequence Svipdagsmal, which I was editing for my doctoral thesis. I prepared full transcripts of each manuscript on a Macintosh computer, and then transferred them to the Vax (itself, I remember, not so straightforward a task in those days of the floppy disc). I collated the transcripts using the Spitbol program, and created various kinds of output. One of these outputs became the apparatus for the critical edition included in my thesis. Another output was translated into a relational database, which I used to explore the relationships between the manuscripts. To optimize this, information about just what manuscript had what variant in the database was held in a matrix, with rows representing each variant, and columns representing each manuscript. Thus:

1 0 1
1 1 0

showed that manuscripts A and C agree at the first variant (both having variant ‘1′, while B has ‘0′); manuscripts A and B agree at the second variant (both having variant ‘1′ while C has ‘0′). This matrix has a historical importance: this was the data given to the participants in the ‘textual criticism challenge’ of 1991 which established, firstly, that phylogenetic methods were far ahead of any other kinds of analysis in applicabililly to the analysis of textual traditions, and, secondly, that phylogenetic analysis could prove genuinely useful in establishing historical relations within a textual tradition.
Collate 0 consisted of around 1200 lines of Spitbol code. Spitbol was (and is: versions of the program are still maintained) a rather beautiful program, built around pattern-matching algorithms. It had some very neat string matching and storage facilities (including, a nifty table facility with hash and key tools). You could write functions within it, but by modern standards, its data models were crude: everything was a string, and that was that. Oxford was then a stronghold of Spitbol (and Snobol) programming: Susan Hockey taught a course in Snobol (I think) and I remember many animated discussions with her and with Lou Burnard about what I was trying to do.

Collate 0 established several approaches to collation which I retained in the later versions of Collate, and which indeed will (I think) be part of CollateXML:
  1. Collation should be based on full transcripts of the manuscripts. This seems obvious now; it was less so then
  2. One should collate all the versions at once, at the same time, rather than (say) running many pair-wise comparisons and then melding the many comparisions into one
  3. The text needed to be divided into collateable blocks. This required some system of marking the blocks: I adopted the COCOA system, then used by the Oxford Concordance Program, for this
  4. Other textual features (notably, abbreviation) needed markup
  5. Some kind of regularization facility was needed to filter out ’spelling’ from ’substantive’ variation
Collate 0 was successful in two key ways:
  1. I managed to finish my thesis, and got my doctorate, despite spending countless hours, often in the dead of the night, deep in Oxford University Computing Services on 6 Banbury Road (and briefly in a OUCS annex in South Parks road) peering at the green symbols on the darkened Vax terminal, and endlessly tinkering with and re-running the Spitbol program
  2. I wrote two articles for Literary and Linguistic Computing about this work. On the strength of these, and with Susan Hockey’s guidance and help, I submitted a grant application to the Leverhulme Trust to carry on this work.
This grant proposal was successful, and in September 1989 I started work on what became Collate 1. Only one person had ever used, and probably could ever use, Collate 0: me. Rather a lot of computer programs, I have since discovered, are only ever used by the person who wrote them (including, indeed, some made with much public money). Our proposal to the Leverhulme trust specified that our collation tool could be used by many other people. This meant a real graphical user interface, not the command-line tool which Collate 0 was. Indeed, one needed a graphic interface because I was by then convinced (and, i still believe) that scholarly collation is an interactive activity. I found that in Collate 0 I spent endless hours manipulating the collation output by tinkering with the program itself, and by compiling complex regularization tables to smooth out idiosyncratic spellings from the tables. This was extremely clumsy. I determined that in Collate 1, we would have the computer make a first guess at the collation for any part of the text, a block at a time. The scholar would examine that collation, and then intervene in a point-and-click way to adjust the collation as needed. For medieval texts, some form of spelling regularization was required. In Collate 0 the regularizations were held in separate files, which were loaded at runtime: so you had to run the collation, look at the results, see what needed to be changed, open and edit the files (with a VI line editor, no easy thing), then reload and run the collation again — and so on. In Collate 1, I wanted to point at what word we wanted regularized to what, and to see the result instantaneously. Similarly, I now knew that any automatic system was going to make decisions about precisely what collated with what which a scholar would find unsatisfactory. Take the collation of ‘a cat’ against ‘cat’. Should we regard this as replacement of one word by a phrase, or of identify of one word (’cat’) in each souce and addition of another word (’a') in one source? In Collate 0, such intervention was done in the nastiest possible way: by hardwiring various gotchas into the collation code itself. In Collate 1, this should be done again by some kind of user-intervention, working in a graphic userface.

This was September 1989 and if you wanted to make a program for personal computers with interactive point-and-click facililities there was only one choice for it: the Macintosh. Microsoft had attempted two versions of Windows up to then, but neither appeared sufficiently stable for a neophyte programmer. By comparision, programmer tools for the Mac were well advanced. Also, I knew Macintosh computers very well, as I had used a succession of Macs for writing my thesis. Apple Computer donated a Macintosh SE (I think) to the project, we purchased a C programming compiler — Lightspeed C, which became Think C quite soon — and we were started. In the early days we did not even have a hard disc. The SE had two floppy disk drives, which made it a truly luxurious machine in those days: you could have the program and some data on one floppy disc drive, and the operating system and other data on another floppy disc drive. Much of the time was spent juggling data and programs between discs, ejecting and inserting disc after disc, sometimes hundreds of times a day (so much so, that someone even adapted the pop-up mechanism from a toaster to automate insertion and removal of discs).

The choice of C meant a complete ground-up rewrite of the program, within a windows/icons/menus/pointer (WIMP!) environment. So Collate 1 began, with the first versions released in 1991. This retained the fundamental features of Collate 0 referred to above (collation by blocks based on full transcripts, basic markup) with newer tools: a ‘live’ collation mode combined with point-and-click adjustment of regularization and setting variants; expanded and more flexible markup, including notation of layout features such as pages, columns, lines and text ornamentation; output formatted for TeX processing using the Edmac macros for complex critical edition layout. In a series of talks in 1990 and 1991 — at the New Chaucer Society conference in Canterbury; the ALLC conference in Phoenix, Arizona; in Austin, Texas; at Georgetown University in Washington; at the Society for Textual Scholarship in New York; especially, at the CHUG meeting in Providence — I described the unfolding Collate, and recruited its first enthusiatic and hopeful users. Some of these users are still with Collate, many years on: Don Reiman and Neil Fraistat incorporated it into the work they did on their Johns Hopkins Shelley edition; hardly a week since has passed without a message (admonitory, exhortatory, or plain friendly) from Michael Stone; and after fifteen years Prue Shaw was finally able in 2006 to publish her edition of Dante’s Monarchia, built with Collate.
Collate 1 established the user interface still basic to the current Collate 2, which has retained all the major features outlined above. Collate 2 also is built on the same C code as Collate 1. There is no ‘clean break’ between Collate 1 and 2 as there is between Collate 0 (written in Spitbol) and Collate 1 (written in C) — and as there will be between the current Collate 2 and its successor (which I now think of as CollateXML, and which I now contemplate will be written in Java, ‘now’ being January 2007). However, various developments in the early 1990s led to such a drastic reshaping and enlargement of Collate 1 that I came to think of this as ‘Collate 2′. These developments, in no special order, were:
  1. The onset of the Text Encoding Initiative. Oxford, through Susan Hockey and Lou Burnard (in those days, the Tony Blair and Gordon Brown of UK humanities computing), was the European leader of the TEI. I found myself drawn into the TEI orbit, even becoming the absurdly underqualified chair of the Scholarly Apparatus workgroup (which included Robin Cover, Ian Lancashire, Bob Kraft and Peter Shillingsburg, so you can see how junior I should have felt). I also attended meetings of the primary source transcription workgroup, though for some reason this has never been recognized in the TEI documentation, and I ended up writing almost the whole of the chapters on textual apparatus and transcription encoding in the TEI (though again, this has never been clearly acknowledged). Through the TEI I learnt about SGML, and became completely convinced that structural markup (though not hierarchical markup) is key to useful scholarly work in the digital age.
  2. The appearance of the web. Oxford was one of the very first sites to mount a web server (as early as late 1992, if I recall rightly) and I attended the first web conference, held at CERN in April 1994, when the web was still small enough for a meeting of server administrators to be held under a tree on the lawn outside the CERN lecture halls.
  3. The development of the Canterbury Tales project. In our proposal to the Leverhulme Trust we stated that we would use the manuscritps of the Wife of Bath’s Prologue as test material. Susan Hockey and I did not think very deeply about this choice: we were just looking for something that was not Old Norse (our other choice of test material was the Old Norse Solarljod — and this year, finally, my and Carolyne Larrington’s edition of this should appear in the massive new edition of Old Norse skaldic poetry), which was in about the right number of manuscripts, seemed to present interesting problems, and would be fun to work with.
  4. The demands of other Collate users. The key group here was the Institute for New Testament Research, Munster. I first met this group in 1996: in 1997 I started working with them intensively on the Nestle-Aland Greek New Testament, and through them met David Parker and the scholars he was working with in Birmingham.
  5. Collaboration with researchers in evolutionary biology. I had already discovered the power of phylogenetic methods through Robert O’Hara: particularly, his entry to the ‘textual criticism challenge’ in 1991, showing how these methods worked with the Old Norse Svipdagsmal tradition. Robert and I developed this into several articles but were unable to carry it much further. However, in 1996 I met, through Linne Mooney, Chris Howe of the Cambridge University Department of Molecular Biology. As a professional evolutionary biologist, he was able to bring many more resources to this enquiry — particularly, he brought in a series of remarkable individual researchers to the work, each contributing new perspectives.
In different ways, these forced me to refine what Collate did, and to develop new capacities for it, to such an extent that Collate became a new program. The key change was that I came to think that the aim of Collate was not to help scholars prepare print editions, but to help them make electronic editions. This had many consequences. Particularly, it meant that Collate had to prepare materials for inclusion in an electronic edition. This meant first of all SGML — and later, XML and HTML. This meant also extended parsing facilities. I did not go so far as adapting Collate to collate files fully encoded with SGML. Collate now had a body of users with many files encoded in the Collate format and content to go on using that format and I would have had considerable difficulty persuading them to move over to full SGML. But I did tighten the Collate encoding model to make it closer to SGML, and then added comrehensive facilities to translate Collate encoded files to SGML (and also XML, HTML and other systems). I also folded two full SGML parsers into the program: both Pierre Richard’s YASPMAC and James Clarke’s SP. These were used particularly for translating SGML encoded apparatus files into other forms, particularly into NEXUS files for analysis by evolutionary biology programs.

While these extended Collate’s grasp, the requirements of its most demanding users forced it in other directions. One of these demanding users was the Canterbury Tales Project. As we moved onto larger sections of text, and particularly sections where no two manuscripts had the same lines in the same order, I discovered we needed a much more powerful system for dealing with witnesses which had the text blocks in many different orders. ‘Block maps collation’ was, and is, Collate 2’s solution to this. But perhaps the biggest shift of all was one that many users may not see at all. This is the adoption of ‘parallel segmentation collation’, directly as a result of the experience of working with Munster scholars and with evolutionary biologists. I explain at some length exactly how these two groups led us to abandon the ‘base text collation’ we used before 1998 in favour of ‘parallel segmentation collation’ in the article ‘Collation Rationale’ included in the Miller’s Tale CD-ROM.
Adopting this model forced changes on many areas of the program: particularly, on the ‘Set Variants’ module, and also on the kinds of analysis and variant display we could now achieve. Perhaps most of all, it puts us in reach of a yet more sophisticated mode of collation: what I describe as ‘multiple progressive alignment’ in the ‘Collation Rationale’ article. Briefly: once we have aligned the variation across the witnesses into parallel segments, one could then go a step further and analyse the witness groupings within the segments. This is standard practice in analysis of variant DNA sequences in evolutionary biology but I have not implemented this in Collate 2: here, indeed, is a task for the next Collate.

Collate 2 was formally released in 1996, and has been continually refined since then. The development of Collate 1 and 2 now spans over seventeen years, from late 1989 to 2007, and there is C code within Collate dating back to the very beginning of Collate 1. This is an eon in the software world. Further, what was a great benefit in the software world in 1989 — the availabiltiy of the Macintosh interface for interface programming — had by 2007 become a cul-de-sac. The introduction of Macintosh OS X from 2000 on rendered the future of Macintosh Classic applications very dubious. I could, in theory, port Collate to OS X and a few times after 2000 I began to experiment with such a port. I discovered, very quickly, that this would be a huge task. The Collate code has grown to around 180 files, amounting to around 120,000 lines of code. Perhaps most discouraging: there are over 80 dialogue windows in Collate, managing the user’s interaction with the program. Some of these — notably, the regularization and set variants windows — have extremely complex execution flows built in them, refined over more than a decade’s experience. One might abandon some of these: but many of these windows would have to be hand-made anew in the OS X environment. Further, OS X changed many aspects of the graphic environment inhabited by Classic, and one would have to go through the code line-by-line at some points changing the old for the new. Many of these changes would involve complex reprogramming. And at the end: one would have a program which still ran on only one operating system.

Other things, too, had changed. The mantra of ‘write once, run everywhere’ had taken root, and a new generation of tools (notably, the Java programming environment) had arisen to support this aim. It is now a real possibility to write a complex graphic user interface program which runs identically, and as if native, on multiple platforms. Further, the XML world has matured, with a speed that would seem unimaginable to the very slow pace of development of applications for its predecessor, SGML. And most decisively, perhaps: a model of open-source collaborative programming has developed. All the time that I wrote Collate 1 and 2, the authoring model for software was modelled on that for books: a single person wrote the software, and then it was sold. But since the mid 90s, the open source movement, built on voluntary collaboration, has gathered pace. This is particularly so in the university and research worlds, where the news that you might even be considering writing software to sell is met with disbelief — so that funding bodies routinely now insist that software code be open source. Within the XML world too, another model of programming has also developed: away from the all-inclusive this-application-will-do-it-all to a federated world of individual co-operating programs. This is particularly true in the web world: a simple user request may invoke one program to work out how to respond, which then summons data from a relational database, combines this with other data from an XML database (using XQuery and other X applications), blends into XML, which an XML formatter then transforms to HTML, which the server then passes back to the requester.

This leaves us, then, with a set of directions we can follow for CollateXML:
  1. It will have all the functionality of Collate 2; particularly, it may support interactive user-adjustable collation
  2. It will be written in a modular form, so that (for example) applications which want to use collation services but not to offer interactive adjustment of collation can embed the collation services in their own environment apart from the user interface
  3. It will handle native XML, both with and without a schema or DTD. However, it should employ its own data interface, independent of XML, so that future or other markup languages (including, indeed, the existing Collate markup) could be readily supported by the program. I am known for predicting the demise of XML: an event which will occur when computer science departments recognize that the overlapping hierarchy problem is not a ‘residual’ difficulty, but a fundamental feature of text.
  4. It will be written co-operatively, in an open source environment
  5. The best bet for its development appears to be Java. The range of XML tools already offered by Java gives us an excellent platform — as, too, the remarkable string-processing library Java offers. Combine this with its high modularity, its excellent support for graphic interfaces, and its popularity with XML developers (not least, the eXist world) and we have an extremely compelling case.
So far, the history of Collate.

All this means: the next version of Collate must be open source.

Collate 2, and the design for its successor: CollateXML (now, CollateX)

Historical note: the contexts of this document

The article which follows was posted on the Scholarly Digital Editions blog in a series of entries from February to June 2007.  There were two contexts for these original blog posts:
  1. The age, and impending death, of Collate 2, the computer-assisted scholarly collation program I had started writing in the late 1980s. This had achieved some success, at least by the simple measure that it was one of the rare "humanities computing" (as we used to call it) computer programs used by people other than the person who wrote it. So we used it in the Canterbury Tales project to make some six digital editions; Prue Shaw used it to make her editions of Dante's Monarchia and Commedia (see www.sd-editions.com); the Greek New Testament editing projects at Munster and Birmingham made it the centre of their moves into the digital age; Michael Stolz used it for his edition of Parzival (www.parzival.unibe.ch/home.html). Collate 2 was written for classic Macintosh computers, and the advent of OS X from 2002 on first cast a doubt on the future of the "classic" operating system, and then became a death sentence when Apple announced in 2005 that OS X was moving to Intel processors, and that when it did make this move, the classic system would die. Of course, I could have rewritten Collate for OS X. However, it was clear that this was no simple matter. The heart of Collate was (and is) a series of interactive routines, allowing the scholar to control the collation through multiple dialogue boxes: over a hundred of them, all told. OS X introduced a quite different (and vastly superior) model for handling interactive dialogues; every one of these venerable "event loops" would have to be rewritten.  I could have done this, but by this time, I was aware that there were fundamental things which Collate could not do. Sometimes, you can renovate.  Sometimes, you have to rebuild from the ground up.
  2. As I was mulling this over from 2005 on, I began to talk to two people in particular who were interested in computer scholarly collation systems: Fotis Iannidis, the director of TextGrid, and Joris Van Zundert, engaged in software systems development at the Huygens Institute. Both Fotis and Joris had a vital scholarly interest in collation, and both might have access to resources (which might need to be considerable) to write a successor to Collate. In late January 2007, Joris convened a meeting in The Hague to discuss editing software systems, including collation; in early 2008, Fotis came to Birmingham to discuss how we might proceed. 
This blog post was directly stimulated, then, by the meeting with Joris and others in 2007. It is an attempt to lay out what I thought might be fundamental to a useful successor to Collate. At the time, I said, half-jokingly, that it took me about five years on my own to write the first version of Collate; it would take ten people ten years to write its successor.  Following these meetings, by various indirections, the InterEdition project (bringing together Huygens and TextGrid people) got started.  A group loosely based within InterEdition took on what quickly became known as CollateX, with Ronald Dekker particularly taking on writing the core software routines.  A look at the CollateX site, some seven years on, suggests that my idle prediction was not too far astray.

Now, the original blog posts, as posted between 12.58 pm GST, 5 February 2007 and 10.02 pm, 28 June 2007. I follow this with an email I sent to Fotis, Joris and others announcing these posts. Among other matters: the post of June 21 announces the new name: CollateX.  Thus it has been since.


February 5, 2007
The design of CollateXML

Filed under: Designing Collate — Peter @ 12:58 pm

In this document I set out, as clearly as I can, the various datastructures and operations which I think CollateXML will require.

The fundamental design of CollateXML is this:

   1. The input is various streams of text, divided into marked collation blocks
   2. These various streams of text are located
   3. Within the streams of text, each corresponding block for collation must be located
   4. The collation program creates two sets of collation information:
         1. concerning the different orderings of the blocks within the streams of text
         2. concerning the differences in the texts contained in the blocks themselves

The collation information is then formatted for output
 A few observations:
  1. In Collate0-2, the input was always computer files, held on the computer itself. In CollateXML, the input could be ‘any text, anywhere’: from a database, local or remote; from a URL anywhere.
  2. The crucial marking of collation blocks should be done through something like the ‘universal text identifier’ scheme I outlined at The Hague on 25 January 2007.
  3. Collate0-2 did only ‘word by word’ collation. This presumes that the texts are ‘word by word’ collatable: without very large areas of added, deleted, or transposed text. But many texts have a different kind of relation: large portions of one text might be embedded in another text, but other areas of the texts are very different (the situation common in plagiarism, or ‘intertextuality’, for example). Collate0-2 did not handle this situation; CollateXML should be able to do so.
  4. CollateXML should have its own internal data models for passing information both to and from the collation process. These models should be exposed through an API to programmers, who can then provide import and export for whatever formats they choose.

      We can now begin to specify the building blocks we need.

February 5, 2007
How CollateXML should work
Filed under: How CollateXML should work — Peter @ 1:18 pm

The separation of collation into stages

As Collate0-2 developed, I learnt that one had to break the collation process into stages. At first, Collate 0-2 simply collated, and identified the variants as it found them. I soon learned that the complex requirements of scholarly collation demanded the adjustment of the collation at various points. To do this, it became clear that one had to separate out the stages of collation to permit intervention at various points. However, this separation was grafted onto Collate0-2 in a a piecemeal fashion. I propose that from the beginning, CollateXML separate out what appear to me now as the following fundamental stages of collation:
  1. text alignment, one witness at a time against the base
  2. storage of alignment information for all witnesses against the base
  3. adjustment of alignment information for all witnesses against each other
  4. variant identification within the aligned texts.
Text alignment

The fundamental building block is the alignment routine itself. Here is how I suggest this works for word by word collation, based on how it worked in Collate 0-2.
  1. Each alignment act compares two texts of a text block at once: a specified ‘base’ text and a witness text. starting at the first word of the text blcok in each
  2. The alignment examines the first word: if they are identical it returns this information; if there is a variant it returns that information along with: the number of words matched in base text and witness text. The possibilities are:
    1. same word in each. 1 word matched in each; next word to match in base will be word 2; next word to match in witness will be word 2
    2. one word replacing one word in each. next word to match in base will be word 2; next word to match in witness will be word 2
    3. word omitted in witness. next word to match in base will be 2; next word to match in witness will be
    4. word added in witness. next word to match in base will be 1; next word to match in witness will be 2
    5. phrase omission or addition: as for c and d, but the next word to be matched in base or witness will be adjusted accordingly
    6. phrase replacement: if two words in the base are replaced by three words in the witness: then the next word in the base to be collated for this witness will be word 3; the next word to be collated in the witness will be word 4.
This is, fundamentally, rather simple. You can look over the C code for Collate 2 to see how we did this. Essentially, at each alignment, Collate 2 carried out a series of tests, till it got a match:
  1. are the next words identical?
  2. are the next words a variant, and so align against each other? Collate used a ‘fuzzy match’ algorithm for this. Essentially, if the two words had more than 50% of their letters in common (weighted according to the position of the letters) then Collate said, these words align. Thus, Collate would see ‘cat’ and ‘mat’ as variants on each other
  3. it could be that while this word does not match, the next word does. So Collate will look at ‘black cat’ and ‘white cat’ and declare that ‘black’ and ‘white’ align, because the next word is a match. Indeed, Collate would look at ‘black cat’ and ‘white mat’, see that mat/cat align because they satisfy the fuzzy match test, and so declare black/white align
  4. If there is still no match: collate tests the second word in the master against the first word in the witness. If they match: Collate concludes that the first word in the master is omitted.
  5. Still no match: collate tests the second word in the witness against the first word in the base. If they match: Collate concludes that the first word in the witness has been added. Now, here is an important point: after establishing that the first word in the witness has been added, Collate goes around again to collate the SECOND word in the witness against the first word of the base, and reports a SECOND variant at this point. For example: if the base has ‘mat’ and the witness has ‘black cat’ Collate could report that ‘black’ has been added, and that ‘cat’ is a variant on ‘cat’. See further below on additions and omissions.
  6. Still no match: Collate guesses that maybe the problem is word division. So it concatenates words in the base and the witness, comparing as it does, to see if it can find a match
  7. Still no match: Collate starts searching for phrase variants: addition/omission/replacement. In essence, it looks further along the text, seeking to find sequences that match, with everything up to the match a replacement/addition/omission. This is probably the least sophisticated part of the current collate. Collate also has a limit of 50 words for its look up: this might be lifted.
  8. After Collate has found the match on this first word: it then looks to check if the NEXT match between this witness and the base is an addition in this witness. The reason it needs to do this is explained in the section on additions and omissions below.
After Collate has done this for this word in the base against this witness: it goes on to do the same for the next witness against this same base. As it identified each alignment, it stores the alignment information for each witness. When it has worked its way through all the witnesses for this word, and has stored the alignment for each: it proceeds through the next stages of adjusting the alignment and finally identifying the actual variants.

After completing, as described, the alignment for the first word of the base against all the witnesses: Collate now goes on to align the second word of the base. Notice particularly what happens when Collate discovers that it has already matched past the next word in the witness: when, say, the first six words of the base have been replaced by the first eight words of the witness. In that case, Collate will skip over that witness until it is collating word 7 of the base: it will then restart the collation by collating word 9 of the witness against that word.

Alignment is NOT variant identification

I have spoken so far only of text alignment, not variant identification. The difference is important. Here is an example:

Base the black cat
A The black cat
C The black cat
D The, black cat

For the purposes of text alignment: the collation algorithm should ignore case differences, punctuation tokens, and XML encoding around or within the words. Thus: it should identify the first word of each one of the four witnesses as aligned against the first word of the base. But note: it might be desirable to identify each or any one of the four first words as having a variant at this point. This variant identification is to be done at a later point. For now, all we have to do is state that the first word in each of the four witnesses aligns against the first word of the base.

Additions and omissions in Collate
This is a particularly difficult area. I discovered in the course of writing Collate0-2 that different people want all kinds of different things. Some people do not want to see additions and omissions at all, but only replacements of shorter or longer phrases by longer or shorter ones. When it is an addition and the scholar wants this seen as a replacement of a shorter phrase by a longer one, some scholars want to see the addition attached at the beginning, some at the end. Take this text:

Base: a cat
witness: a black cat

Here are the possibilities:

a ] a; black added
cat ] cat (writing the addition with the PRECEDING word)


a ] a
cat] cat; black added (writing the addition with the FOLLOWING word)


a ] a black
cat ] cat (as phrase, addition with preceding word)


a ] a
cat ] black cat (as phrase, addition with following)


a ] a
… ] black
cat ] cat (this is actually the system used in Munster!)

Collate0-2 supports all these possibilities, but does it in a rather inelegant way. Essentially, it tries to adjust for these possibilities WHILE it collates. This is complex and inflexible. Instead, I propose that CollateXML separate completely the discovery of alignment, its storage and its expression. Collate0-2 almost does this, but does not do it thoroughly. Broadly, I began with outputting the variants as the program discovered them. Increasingly I found that one needed to adjust the variation in various ways, and so moved towards separation of the stages of alignment discovery, storage and variant identification. A major benefit of this separation is that it permits adjustment of the variation at the storage point: see next. However, Collate0-2 never quite managed a complete movement to this separation. I propose that CollateXML has this separation at its heart.

The storage of alignment information
So far, I have been describing how Collate discovers alignment. Essentially, Collate discovers, for each word in the base text, for a particular witness, exactly what alignment is present in a given witness at that point.
The possibilities are:
  1. base and witness align at this word, either because there is no variation (base and witness agree on this word) or because base and witness vary at this word (either as: omission of this word, or variation on the word)
  2. base and witness align and there is an addition before and/or after this word (note: this includes the possibility of an addition before the word, omission of the word, and addition after the word)
  3. the word is the beginning of a phrase alignment, with or without an addition before the phrase aligment (note: this includes the possibility of phrase omission
  4. the word is the ending of a phrase alignment, with or without an addition after the phrase alignment (including, the possibility of phrase omission)
  5. the word falls within a phrase alignment (for example: ‘the black cat’ replaced by ‘a white mouse’. When Collate comes to collate ‘black’ in the base against this witness, it will find that it falls within a phrase alignment and move onto the next word.
It can be seen that alignment is much complicated by the need to deal with ‘additions’. Again, Collate0-2 never quite dealt with this as well as it needs to, and again I propose to remedy this in CollateXML. I suggest that they be dealt with as follows:
  1. The base text is seen as a series of slots, corresponding to the words AND the space before the first word, between each word, and after the last word
  2. The variants in each witness be aligned against these slots. Thus: an addition before the first word is aligned against the slot before the first word; a variant at the first word is aligned against the first word; an addition between the first and second words is aligned against the slot between these words, and so on.
One may illustrate this with the base ‘black cat’ collated against the witness ‘the black and white cat’. Numbering each ’slot’ in the base from zero, we have:





Thus: the additions ‘the’ and ‘and white’ align against slots 1 and 3. In this system, even numbers are used for words; odd numbers for the spaces between the words. (I am indebted to the Institute for New Testament Research, Munster, for this numbering system, and this conception of the base as a series of slots for both words and the spaces between them).

The adjustment of variant information: a relatively simple case

The principal benefit of the separation of the stages of alignment discovery, storage and output is that it permits adjustment of the variant alignments at the storage stage and before the output.

Consider the case of the following (rather fictitious) instance:

Base:      The cat sat on the mat
Witness: The black sat on the mat

Left to itself, Collate0-2 will tell us that the variant is:

cat ] black

But in fact, what has happened here is more correctly:

The ] The
.. ] black
cat ] omitted

That is: first ‘black’ is added, then somehow ‘cat’ is omitted.

We may have many witnesses which read ‘The black cat’ where the base reads ‘The cat’. In this case, at the storage stage, we should expect Collate to look over the variants discovered in the other witnesses, find that in many others we have ‘black’ added, and it should then adjust the stored variant information so that instead of reading:












the stored representation reads:











That is: with ‘black’ matching against the space between ‘the’ and ‘cat’ (as it is in other witnesses) rather than against ‘cat’.

Collate0-2 did NOT do this variant adjustment. CollateXML should do it. In the next section, I consider some possibilities.

The adjustment of alignment information: towards multiple progressive alignment

This matter of automated adjustment of variant information at the storage stage — that is, after the collation of a particular word has finished — is one area where the algorithms of Collate0-2 could be dramatically improved.

Consider, first, this case:

base        the white and black cat
witness1 the black and white cat

Collate0-2 will record this as a single piece of variant information: that the whole phrase ‘white and black’ in the base has been replaced by the whole phrase ‘black and white’. It has been pointed out to me, quite separately, by two very different groups of scholars, that this is inadequate (the two groups are: the Münster institute, and the department of Molecular Biology in Cambridge):
  1. This does not record that the words of the variant text ‘black and white’ are actually the same as those of the base
  2. As a result: suppose that a second witness has ‘green or blue’ for this phrase. To the program (and hence, to any system based on it) there is exaclty as much difference between the variants ‘black and white’ and ‘green or blue’ and the base text ‘white or black’ as there is between each variant and the base text. But this loses a key piece of information: that the variant ‘black and white’ is actually much close to the base than is the variant ‘green or blue’.
CollateXML needs to find a way of adjusting the variant store to show that in fact the variant ‘black and white’ represents not one, but four pieces of information:
  1. firstly, that there is a phrase variant (the existing Collate0-2 algorithms do this)
  2. secondly, that actually each word in the phrase variant does agree with the base: a further three pieces of information (Collate0-2 goes a little way towards this, but not far enough)
Consider, further, this case:

base the white and black cat
witness1 the black and white cat
witness2 the black and green cat

Here, we should show that witness2 both has a phrase variant AND is a witness for the words ‘black’ ‘and’ — and, furthermore, has a variant ‘green’ on the word ‘white’ in both the base and witness1. One wants an output as follows:

white and black ] black and white witness1; black and green witness2
white ] witness1; green witness2
and ] witness1 witness2
black ] witness1 witness2

If we can figure out a way to store this information then we are well on our way to collation nirvana: multiple progressive alignment. But before we get to that place: we have to understand parallel segmentation.

Variant information storage and parallel segmentation

Perhaps the single most important development in Collate2 was the support for parallel segmentation. I write about this in the ‘Collation rationale’ on the Miller’s Tale CD-ROM. The example I use there is

This Carpenter hadde wedded newe a wyf
This Carpenter hadde wedded a newe wyf
This Carpenter hadde newe wedded a wyf
This Carpenter hadde wedded newly a wyf
This Carpenter hadde E wedded newe a wyf
This Carpenter hadde newli wedded a wyf
This Carpenter hadde wedded a wyf

In that article I explained that in the early versions of Collate, we used to collate this by what I called ‘base text collation’: that is we would compare each witness (54 in this case) word by word with this one base, one witness at a time, and output the variation so:

This ] 54 witnesses
Carpenter ] 54 witnesses
hadde ] 54 witnesses
wedded ] 53 witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
newe ] 26 witnesses; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses
a ] 30 witnesses
wyf ] 54 witnesses

We see here that for the first three words and the last word there is no variation, and we just state accordingly that all witnesses there agree with the base and with each other. All the variation occurs on the three base text words ‘wedded newe a’. This variation is actually recorded against five lemmata: in turn ‘wedded’, ‘wedded newe’, ‘newe’, ‘newe a’ and ‘a’. Observe that the phrases ‘wedded newe’ and ‘newe wedded’ both overlap one other, and also overlap the three words ‘wedded’ ‘newe’ ‘a’.

The ‘Collation rationale’ article goes on to explain why we became increasingly dissatisfied with this method. One factor was that it highlighted the base text: by referring all variation to this base text, it gave the base text a prominence which we did not think appropropriate. We thought of the base text as just a series of slots on which we hung the collation: but this mode of expression seemed to give it an authority beyond this. It is not that we do not believe in ‘edited’ texts: just that this base text was not conceived, or intended to be, any such edited text. But its prominence made it look as if it could be such an edited text.

A second factor was the argument put to us by the evolutionary biologists: that where variant lemmata overlap, as they do in the cases of the five variants on the three words ‘wedded newe a’, one cannot compare directly the different witnesses. Here, we have one set of variants on the phrase ‘wedded newe’ and a second on the phrase ‘newe a’, as well as variants on each individual word. If manuscript A has a variant on ‘wedded newe’ and B has one on ‘newe a’ there is no way one can compare the text of A and B directly, and make any statement at all about the relationship between A and B at those points.

This defect in base text collation had other implications. We wanted to be able to point at any word in any manuscript and say: what readings do the other manuscripts have at this point? But this was exactly what our system could not do. With our system, we could only say: at this word, the base text has such and such. We could not always say: at this word, here are all the readings found at this point in all the other texts. Similarly, we wanted to be able to compare any two (or more) manuscripts word by word, showing exactly how they differ. Once more, this system could not do that: we could only show how they severally differed from the base text, not how they differed from each other.
The only cure for this we could see was: eliminate overlapping variation. This meant that we should refer all variants in all witnesses to the same base lemma. This meant that, practically, the unit of variation has to be fixed by the longest variant present at any point. In the case of the Miller’s Tale example: with base text collation we have five sets of lemmata in the three word base sequence ‘wedded newe a’, and so cannot compare the witnesses on any one of the lemmata with those for any other. To eliminate all overlapping variation here we should have one lemma and one lemma only: all three words of the base text here. All variants on this one lemma are then directly in parallel with each other. The whole text, across all the witnesses, is broken into parallel segments, with text of any one witness at any one segment being directly comparable to the text of any other witness at that segment: hence, the name ‘parallel segmentation’.

This is the collation given by the base text collation system, with five different lemmata:

wedded ] 53 witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
newe ] 26 mss; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses
a ] 30 witnesses

Now, this is the collation given by parallel segmentation, with just one lemma:

wedded newe a ] wedded newe a 25 witnesses
wedded a newe 23 witnesses
newe wedded a 1 witness
E wedded newe a 1 witness
wedded newly a 2 witnesses
newli wedded a 1 witness
wedded a 1 witness

How did Collate0-2 identify parallel segments? Collate0-2 used a system of variant information storage similar to that outlined above: essentially, creating a table which in numeric form exactly what words in each witness correspond with what words in the base. It would update this table after each word collated in the base. Then, it would inspect the table, and ask: is there a variant lemma open a this point? If there were, then it would not output any apparatus, but move on to the next word, and only when it found no variant lemmata open would it output all the variants on the whole segment of text.

Thus, for the base text sequence ‘wedded newe a’ it would proceed as follows. It would collate the first word, ‘wedded’, and discover the following:

wedded ] 53 witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
That is, the lemma ‘wedded newe’ is still open after collation of ‘wedded’. So no apparatus is output, and it goes onto the next word:
newe ] 26 mss; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses

Now, the lemma ‘wedded newe’ has been closed. But another variant lemma ‘newe a’ is now open. So we have to carry on to the next word:

a ] 30 witnesses

Now, at last: no phrase variant is open. We can close the segment, and output all the variation found on the whole phrase ‘wedded newe a’.

The limits of parallel segmentation: toward progressive multiple alignment

Parallel segmentation has served us well. It has allowed us to remove the base text from the apparatus output completely: on our publications now, you do not see the base text at all. We still use a base text when we collate, but its function now is purely to identify the variants present at each point, and we customarily optimize it for that purpose (for example, adding or rearranging words to improve variant identification). The move to parallel segmentation has other benefits. We can now identify at any point in any witness just what witnesses are present at that point; we can compare any two (or more) witnesses; we can create much richer analyses of stemmatic relations. But we are still not happy.

In the ‘Collation Rationale’ argument I cite the variants on the first four words of line 646 of the Miller’s Tale (’He was agast so of Nowelys flood’).
  1. He was agast so 33 witnesses
  2. He was agast 4 witnesses
  3. So he was agast 6 witnesses
  4. He was so agast 7 witnesses
  5. He was agast and feerd 2 witnesses
  6. So was he agast 1 witness
Just as a presentation of the variation at this point, this is quite efficient. But as a representation of the exact linkages between the witnesses, it is rather inefficient. These six variants are presented in simple parallel, as if no two of them are any closer than any other. But manifestly, that is not true. The second and fourth readings ‘He was agast’ and ‘He was so agast’ are much closer to the first reading ‘He was agast so’ than they are to either the third and sixth readings. In turn, the third and sixth readings ‘So he was agast’ and ‘So was he agast’ are much nearer each other than they are to the other readings.

With parallel segmentation, once it has found the segments, the collation stops and just presents the segments it has found. In this collation system, all variants at any point are equally unlike. We require some system of grouping the variants within each segment. For this example, I proposed that that the six variants here should be grouped into two variant sequences:
  1. 46 witnesses: made up of variants 1, 2, 4, 5, all beginning with the words ‘He was..’
  2. Seven witnesses: made up of variants 3 and 6, both beginning with ‘So’
We can break up the first group still further:
  1. 40 witnesses, made up of variants 1 and 4, having the same words but with ’so agast’ transposed
  2. 6 witnesses, made up of variants 2 and 5, both omitting ’so’
Finally, we note that the two groups 1 and 2 (of 46 and 7) are linked together via variant 1 (from group 1) and variant 3 (from group 2): these differ only in their placement of the word ’so’. We can represent this schematically as follows:

From examination of the variant map, we can see that — rather remarkably (or not!) — this representation mirrors the textual history of the tradition. The original reading is likely to have been variant 1 (33 mss). Three variants descended directly from variant 1:

Variant 5 (seven witnesses) by transposition of ’so’, from which a further variant (variant 6, one witness) develops, by transposition of ‘he was’
Variant 4 (7 witnesses) by transposition of ‘agast so’
The ancestor of variants 2 and 5: both omit the ’so’, while 5 adds ‘and feerd’..

Indeed, this distribution is consistent with other groupings established by our analysis.

So, here is the challenge I set in the ‘Collation Rationale’ article, here set out in more detail:
  1. Identify relationships between the variant groups found by parallel segmentation
  2. Work out a way of storing the information about these relationships, so as to enable different kinds of output
  3. Work out the best ways of expressing this information, in some kind of hierarchical or layered form.
At present, we do not have any means of formally expressing the relationships between the variant groups found by parallel segmentation. Here is a draft of how it might be done, using the example above:
He was agast so

He was so agast

He was agast

He was agast and feerd

So he was agast

So was he agast

We need an adaptation of the system used by the TEI to hold this. Ideas please!

Variant identification

So far, we have aligned the texts, stored the alignment identification, and then adjusted the alignment information (we hope, through some form of multiple progressive alignment). But we have not yet identified any variants. Now, consider again our example from above:

A The black cat
C The black cat
D The, black cat

Following parallel segmentation, we may now ignore the base. We look at the first word, and find they are aligned as follows:

A The
C The
D The,

Are these, or are these not, variants of each other? I propose that Collate3 have, for each witness, a specifications object. This will state, for each witness, whether differences of case, XML encoding, and punctuation are to be treated as variants or not. Presume that we direct that case differences and XML encoding are not variants but that punctuation is. We would get the following collation taking A as the base

The ] A B C; The, D

Taking B as the base: the variant would appear as
THE ] A B C; The, D

Or, if we say that punctuation is not significant, but XML encoding is significant, we will get this collation:

The ] A B D; The C

Variant identification and the return of the base

I said above we may now discard the base. To a point, Lord Copper (esoteric joke, see Decline and Fall). There is one critical operation for which we still must retain the base.
The question is to do with the use of variant specifications to identify exactly what is a variant. Suppose for our pair A and D we have the variants THE (A) and The, (D). We have the following variant specifications:

A: ignore case and punctuation
D: ignore case but do not ignore punctuation

We now compare A and D: ‘THE’ and ‘The,’. From the point of view of A: there is no variant here, because we are ignoring both case and punctuation. But from the point of view of D: there is a variant, because we there is a difference of punctuation.

Thus: the variation found changes, according to the point of view. It changes too (obviously) according to which witnesses we are comparing. The only way I can see out of this is to use the base as the measure against which variants are identified, but always do the variant identification using the specifications set for the witness. In this case, presume that the base here is ‘The’, with all witnesses set to ignore case but not punctuation. We will then have:

The ] A B C; The, D

Notice that depending on the base text and the collation specifications, we could get very different results. Suppose that we set punctuation to be ignored in A B C but not D. If we use ‘The’ as the base text, we get this:

The ] A B C; The, D

But if we set ‘The,’ as the base, we get this:

The, ] A B C D

I don’t see any way around this. One could avoid this (as Collate0-2 did) by insisting that all witnesses have the same collation specifications. But it has been forcefully represented to me that it would be very useful to be able to specify different treatments of case/punctuation/xml for different witnesses. So we will do this.

February 6, 2007
Datastructures for CollateXML

Filed under: CollateXML datastructures — Peter @ 5:46 am

From the account of the collation, we are dealing with something very different from ’string comparison’. Indeed, the base unit of the collation is the word: we collate words, not strings. Words may be concatenated, or divided: but words are the basis of it all. (This was the form used by Collate).

For each witness, we need the following information:
  1. Its sigil
  2. Its location (in Collate0-2 this was simply a file name; in CollateXML it might be a url, an xquery or xpath expression, etc)
  3. Collation specifications for this witness. See below.
  4. For each collateable block: two collateable object arrays. See below
  5. For each collateable block: an array of correspondences with the base. See below.
The collation specifications for variant identification

These will control the way what is recorded as a variant against the base. Settings include:

a. case. settings will be collate/ignore.

if collate: Collation will treat differences of case as variants.
if ignore: Collation will not treat differences of case as variants.

b. xml. ignore xml. Settings will be: all/none/nomininated
If none: all xml encoding surrounding, within or between the words will be ignored
If all: all xml encoding will be collated, including empty elements, surrounding, within, and between the words
If nominated: only specified xml elements will be nominated. The details of the xml elements to be collated will be held in a further structure (see below).

c. xmlcollate: null unless xml=nominated. This structure is a series of elements to be collated, as follows:

i. gi: the gi of the element to be collated (including namespace)
ii. attributes: Values are all/none/nominated. If all: all attributes and their values are to be collated; if none, all attribute values are ignored, and only element names are collated; if nominated, details of attributes to be collated are held in a further structure
iii. collateattributes: null unless attributes=nominated. This structure is a series of attribute names which will be collated for this element (this could be further elaborated, perhaps, to set conditions: report as variant if the attribute is a particular value)

d. punctuation. Settings will be all/none/nominated
if all: collate all punctuation, as identified by the isPunctuation method
if none: collate no punctuation, as identified by the isPunctuation method
if nominated: collate only specific punctuation identified by the isPunctuation method

The specifications object must also have at least one method: isPunctuation. For a particular pair of strings, this should identify whether differences between them are purely punctuation (in which case, they might or might not be variants) or not.

Two other methods might be required:

isCaseDifference: if it is found that Java’s native methods for ignoring case difference when comparing strings are not adequate.
adjustXML: for some contexts, we may need to do more than simply ignore/not ignore XML. 

One might here wish to ignore the &per; element and treat this as ‘experience’.

The collation specifications for text alignment

The model here proposed, of separating text alignment from variant identification, presumes that optimal text alignment would be achieved by ignoring differences of case, punctuation and xml. Thus, at the alignment stage, we would use the minimal set of collation specifications for comparison of witnesses with the base.

Hierarchical setting of collation specifications

One would expect that for most collations, one would have identical specifications for all witnesses. In programming terms: one would set the specifications for the class of witnesses, which would then inherit a uniform set of specifications. This design permits that the uniform specification would be overruled for specific witnesses.

The collateable object arrays

The key to Collate0-2 was that it did not collate text strings: it collated word objects. For each witness, it held the words of the text in an array of word objects, numbered from 0 to xxx, and all collation took place against these word objects, with information about variants found stored in tables of numbers referring to these arrays. I propose that CollateXML retain, refine and extend this model.
Collate0-2 accepted ‘plain text’ and converted this to word object arrays as it collated. As it did so, it might remove (depending on various settings) punctuation or other characters from the text to be collated. Thus ‘april / that’ would become:

word 1: April
word 2: that

Notice that the ‘/’ is here removed. At a later point, Collate0-2 converted the text to
<w n=”1″>April</w> / <w n=”2″>that</w>

This is rather unsatisfactory. The relationship between the numbering of the words in the word object array and that in the converted XML depends on rather fragile assumptions about what is and is not a word. I propose instead that CollateXML recommend that for word-by-word collation, input must be in full XML form, with all discreet elements marked as follows:

<w n=”1″>April</w> <w n=”2″>/</w> <w n=”3″>that</w>

This has several implications. It means that, because of the problem of overlapping hierarchies, treatment of elements spanning across words has to be as follows:

<w n=”1″><hi>April</hi></w> <w n=”2″><hi>/</hi></w> <w n=”3″><hi>th</hi>at</w>
<hi><w n=”1″>April</w> <w n=”2″>/</w> <hi><w n=”3″>th</hi>at</w>

The advantage of the explicit labelling of every collateable object in the original text as a <w> element with an ‘n’ attribute is that it makes linking of the collation with the original text absolutely explicit. The ‘n’ attribute on each <w element can be used to denote each word in the collateable object array, and then used to link to the corresponding <w element in the original. (One might — might — use xPath to achieve the same result: that is a matter for discussion.)

I said we need TWO collateable object arrays for each witness. The first array, as specified above, is to hold the original text: call this textOriginal. But in fact, this is not the text which will be actually collated. The second array is the text which will be actually collated: call this textCollateable. 

TextCollateable will have identical structure and initially identical content to textOriginal.

The reason for the two arrays is to make regularization possible. Regularization was one of the great strengths of Collate0-2, and the approach here suggested is based closely on how Collate0-2 worked. As the scholar collates, he or she will see cases where it is necessary to filter out spelling or other non-significant variation. This may involve alteration of word division. Thus, we might be collating:

base: the man Cat
wit1: theman cat

It appears that in wit1 one will want to change the word division for ‘theman’ and regularize ‘cat’ to ‘Cat’. Thus, textOriginal would hold for wit1:

word1 theman
word2 cat

while textCollateable must be altered to:

word1 the
word2 man
word3 Cat

Notice that this will mean keeping an offset pointer at each word, indicating for each array what is the corresponding word in the other array.

Putting this together, we require the following information for each word object in each collateable object array:

  1. the word itself (including, XML encoding)
  2. the n number for the word, to relate to the n number on the corresponding <w> element in the original
  3. the offset to the corresponding word in the other array. Thus: for word 1 in textCollatable the offset would be 0; for word 2 and word 3 it would be -1. For word 1 in textOriginal the offset would be 0; for word 2 it would be +1.
June 21, 2007
Goodbye CollateXML, hello CollateX

Filed under: Introduction — Peter @ 1:32 pm

At last, we are moving. Today I began setting up the source forge site that will take all the code as we start work on the program. And in the process, I did what I have been planning to do for some time: change the name of the program from CollateXML to CollateX. Those who have read all the postings on this (of course, all of you) will know the reason for this change. We plan that the program should be able to collate texts in any format whatever, by devising a single canonical input form and then having translators into this canonical form. Thus it will be able to collate XML, sure: but it will also be able to collate many other formats, including indeed old-style Collate 1-3 files.

June 28, 2007
Collate examples, and first task

Filed under: Collate — Peter @ 10:02 pm
We have decided that the logical place to start is by definition of data structures for the common input phase. So Andrew will get on today with working these out. Here are a few example sets for him to chew on:

Base the black cat
A The black cat
C The black cat
D The, black cat

Base the white and black cat
A The black cat
B the black and white cat
C the black and green cat

This Carpenter hadde wedded newe a wyf
This Carpenter hadde wedded a newe wyf
This Carpenter hadde newe wedded a wyf
This Carpenter hadde wedded newly a wyf
This Carpenter hadde E wedded newe a wyf
This Carpenter hadde newli wedded a wyf
This Carpenter hadde wedded a wyf

  1. He was agast so 33 witnesses
  2. He was agast 4 witnesses
  3. So he was agast 6 witnesses
  4. He was so agast 7 witnesses
  5. He was agast and feerd 2 witnesses
  6. So was he agast 1 witness
5. Time for some XML:
<l id=”MI-35-El” n=”35″>&paraph; <w n=”1″>This</w> <w n=”2″>Carpente&rtail;</w> <w n=”3″>hadde</w> <w n=”4″>wedded</w> <w n=”5″>newe</w> <w n=”6″>a</w> <w n=”7″>wyf</w> </l>
<l id=”MI-35-Ii” n=”35″><w n=”1″>This</w> <w n=”2″>Carpenter</w> <w n=”3″>hadde</w> <w n=”4″>wedded</w> <w n=”5″>a</w> <w n=”6″>newe</w> <w n=”7″>wi&ftail;</w> </l>
<l id=”MI-35-Cn” n=”35″><w n=”1″>This</w> <w n=”2″>Carpenter</w> <w n=”3″>had</w> <w n=”4″>newe</w> <w n=”5″>wedde&dtail;</w> <w n=”6″>awif</w> </l>
<l id=”MI-35-Cp” n=”35″><w n=”1″>This</w> <w n=”2″>Carpunter</w> <w n=”3″>hadde</w> <w n=”4″>wedded</w> <w n=”5″>a</w> <w n=”6″>newe</w> <w n=”7″>wy&ftail;</w> </l>
<l id=”MI-35-Hg” n=”35″>&paraph; <w n=”1″>This</w> <w n=”2″>Carpenter</w> &virgule; <w n=”3″>hadde</w> <w n=”4″>wedded</w> <w n=”5″>newe</w> <w n=”6″>a</w> <w n=”7″>wyf</w> </l>
<l id=”MI-35-Gg” n=”35″><w n=”1″>This</w> <w n=”2″>carpenter</w> <w n=”3″>hadde</w> <w n=”4″>weddid</w> <w n=”5″>newe</w> <w n=”6″>a</w> <w n=”7″>wyf</w> </l>

6 Some more XML, this time with more encoding
<l id=”MI-1-Bo1″ n=”1″><w n=”1″><hi rend =”unex” ht=”3″>w</hi><hi rend=”ul”>Hilom</hi></w> <w n=”2″>ther</w> <w n=”3″>was</w> <w n=”4″>duelling</w> <w n=”5″>in</w> <w n=”6″>Oxenford</w> <note
<l id=”MI-1-Cx1″ n=”1″><w n=”1″><hi ht=”2″ rend=”other”>W</hi>Hilom</w> <w n=”2″>therwas</w> <w n=”3″>dwellyn&gtail;</w> <w n=”4″>in</w> <w n=”5″>Oxenforde</w> </l>
<l id=”MI-1-Bw” n=”1″><w n=”1″><hi ht=”5″ rend=”orncp”>W</hi>ylom</w> <w n=”2″>þer</w> <w n=”3″>was</w> <w n=”4″>dwellyng</w> <w n=”5″>in</w> <w n=”6″>Oxenford</w> </l>
<l id=”MI-1-Ch” n=”1″><w n=”1″><hi ht=”2″ rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w n=”3″>was</w> <w n=”4″>dwellyng</w> <w n=”5″>at</w> <w n=”6″>Oxenforde</w> </l>
<l id=”MI-1-Dd” n=”1″><w n=”1″><hi ht=”5″ rend=”orncp”>W</hi>hilom</w> <w n=”2″>there</w> <w n=”3″>was</w> <w n=”4″>dwellyng</w> &virgule; <w n=”5″>in</w> <w n=”6″>Oxenfor&dtail;</w> </l>
<l id=”MI-80-Bo2″ n=”80″><w n=”1″>As</w> <w n=”2″>brode</w> <w n=”3″>as</w> <w n=”4″>is</w> <w n=”5″>þe</w> <w n=”6″>boos</w> <w n=”7″>of</w> <w n=”8″>a</w> <w n=”9″>bokelyr</w> </l>
<l id=”MI-1-Ad3″ n=”1″><w n=”1″><hi ht=”4″ rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w n=”3″>was</w> <w n=”4″>dwellyng</w> <w n=”5″>in</w> <w n=”6″>Oxenford</w> </l>
<l id=”MI-1-Cp” n=”1″><w n=”1″><hi ht=”6″ rend=”orncp”>W</hi>hilom</w> <w n=”2″>þer</w> <w n=”3″>was</w> <w n=”4″>dwellyn&gtail;</w> <w n=”5″>at</w> <w n=”6″>Oxenfoor&dtail;</w> </l>

enough, now!

Email announcing these posts

Sent 6 February 2007 to Joris, Fotis, and other participants in the January 2007 The Hague meeting:

Dear everyone
at the meeting a week last Friday, we had a good deal of discussion about the future of Collate.  Since the meeting, very much under the influence of Barbara, I have undergone a massive conversion.

Barbara pointed out that at the meeting, I seemed to be resisting the idea of handing over Collate for others to develop.  Indeed I was, and as Barbara pointed out: I was doing exactly what I forever denounce other people for doing: saying 'this is mine! hands off!!'.

So now, I have seen the light.  And indeed, the more I think of it: this is an ideal project for us all to collaborate on.  And I was really impressed with the enthusiasm in The Hague for doing this together.  So here is my suggestion: we develop the next Collate (which I suggest should be called CollateXML) together.  To start off the process, I have created a blog
Here you will find a whole series of materials now about Collate, thus:
An introduction
A history of the three earlier versions of Collate
A design outline for CollateXML
How CollateXML should work
Some Datastructures needed for Collate

There will be more to come, but this will get us running!  I will put up the code for Collate2, some of it, though I doubt this will be as useful as the explanations given on the blog site.

I'd be glad to hand this all over to you folks.  I'm sure happy to help out, and provide test and sample files, etc etc, and offer lots more advice where I felt it might help.  I'd say we could set this up as a source forge project and, well, get on with it.  One place to start would be with implementing the fundamental word by word collation algorithm set out in the 'How CollateXML should work' should work.

Well, I've thrown out the stone into the pond now.  So folks, who is ready to give up a few years of their life getting this to run???
all the best