tag:blogger.com,1999:blog-57740542195854815892024-03-26T23:37:25.420-07:00Scholarly Digital EditionsI am a twenty-year veteran of the making of scholarly editions in digital form, as editor, toolmaker, teacher, commentator, theorist and publisher.PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.comBlogger14125tag:blogger.com,1999:blog-5774054219585481589.post-5501491763515306632023-09-17T20:06:00.000-07:002023-09-17T20:06:06.312-07:00Setting up the revised Collation Editor: file structures<p> In an earlier post, I explained some of the history behind the Collation Editor, and our use of it in Textual Communities. At last, I am updating the Collation Editor embedded into TC!</p><p>The Collation Editor has two major dependencies:</p><p></p><ol style="text-align: left;"><li>On Python, for a series of critical tasks run through a Python server;</li><li>On CollateX, for the actual collation.</li></ol><p></p><p>The first task was to create a version of the Collation Editor Core implementing both dependencies. I did this by mirroring the structure of the stand-alone collation editor code (available at <a href="https://github.com/itsee-birmingham/standalone_collation_editor">https://github.com/itsee-birmingham/standalone_collation_editor</a>). Thus, this is what the top-level folder looks like in my implementation (in my installation, in /Applications/Collation_Editor_Core):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJE17QCegxkrdZYnmMXFBMHTTDbiy27esfWuXHxThCU75PpJtW5GbrRXYMDjFLXpyJ8-H9ipOBaA-t0kSg2e51bvCGQbyVbGzcdYMG6P17G6vAwi1NE6F0LT9W9D2qn47G8w0FeNj7nF422DeEPaZUXXpgPUvg71S6FLOx6P63TfgzGFF6S0QXcdDL/s570/files.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="416" data-original-width="570" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJE17QCegxkrdZYnmMXFBMHTTDbiy27esfWuXHxThCU75PpJtW5GbrRXYMDjFLXpyJ8-H9ipOBaA-t0kSg2e51bvCGQbyVbGzcdYMG6P17G6vAwi1NE6F0LT9W9D2qn47G8w0FeNj7nF422DeEPaZUXXpgPUvg71S6FLOx6P63TfgzGFF6S0QXcdDL/s320/files.png" width="320" /></a></div><p style="clear: both; text-align: left;">That is: at the root level I have a folder holding collateX, with the collatex-tools jar in it. There is a folder labelled "collation" which we will look at in a moment. There are two python files, and then a .sh and .bat file which start up the application (this structure is taken from the current stand-alone collation editor structure).</p><p style="clear: both; text-align: left;">Within the "collation" folder, here is what I have:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkCZgrMD8IpYsYFeHaHmoqcpO4UxG9dLyUtw0kL16j8vK7QOlOYqoceEPeSnRMvabgqCPx53lo7_eCzWveUtmv2Ins9pw8FfkolOsyqIKtBGrsJJeqA_Vhto4AmGjlESBM6O6aGBuIJpN0-XjDBg6_29P5qj1-YRy44LDw-_h1PTa614sfxYpX5Qi7/s624/files.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="286" data-original-width="624" height="147" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkCZgrMD8IpYsYFeHaHmoqcpO4UxG9dLyUtw0kL16j8vK7QOlOYqoceEPeSnRMvabgqCPx53lo7_eCzWveUtmv2Ins9pw8FfkolOsyqIKtBGrsJJeqA_Vhto4AmGjlESBM6O6aGBuIJpN0-XjDBg6_29P5qj1-YRy44LDw-_h1PTa614sfxYpX5Qi7/s320/files.png" width="320" /></a></div>And then, going still deeper, this is the content of the "core" folder:<div class="separator" style="clear: both; text-align: center;"><blockquote><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5Z3bUkh8F1KAXNEMuZQsmGYDOGPg3PnkyrDb7rM_WGkoZ9twsmXKqedRKIXAXp7Dm4jO9VNcD9aQV01VNK-f96tDpQFZKk0hpYBPU51ky_O-lZl_I5DEkbJQq-lGPxEy7syjojcGqDy2dAbh-3pOLziQQls9cni6Q7ZnpaFR8rpqn0OJTezFWtcCy/s602/files.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="526" data-original-width="602" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5Z3bUkh8F1KAXNEMuZQsmGYDOGPg3PnkyrDb7rM_WGkoZ9twsmXKqedRKIXAXp7Dm4jO9VNcD9aQV01VNK-f96tDpQFZKk0hpYBPU51ky_O-lZl_I5DEkbJQq-lGPxEy7syjojcGqDy2dAbh-3pOLziQQls9cni6Q7ZnpaFR8rpqn0OJTezFWtcCy/s320/files.png" width="320" /></a></blockquote></div><div>You see here a series of .py files, all needed for the link to Python to work. However, we need to have an index.html file in place to run the instance. The index.html file is actually contained within the "collation/static" folder, as follows:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjo_xxv2Zur_pEysw5GXIV8kfig2hYSk7n7YcdDCq0qJZWiQv715K5w8XfdccBvc5L8jhy0VdSijZcj55LI2p6snukBUMBQeDTJ59Pv4Erxjq-Q7fLBpAcU1w7epfwffyLN9rC7RBQgCbOPay01Nis-nLfOIyG_k8vstPdVvXSS-TWqhA5vKxlER_5F/s536/Untitled.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="436" data-original-width="536" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjo_xxv2Zur_pEysw5GXIV8kfig2hYSk7n7YcdDCq0qJZWiQv715K5w8XfdccBvc5L8jhy0VdSijZcj55LI2p6snukBUMBQeDTJ59Pv4Erxjq-Q7fLBpAcU1w7epfwffyLN9rC7RBQgCbOPay01Nis-nLfOIyG_k8vstPdVvXSS-TWqhA5vKxlER_5F/s320/Untitled.png" width="320" /></a></div>Here is what the index.html file has, in this starter configuration:<blockquote><div><div><span style="font-family: helvetica;"><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"></span></div><div><span style="font-family: helvetica;"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"></span></div><div><span style="font-family: helvetica;"><head></span></div><div><span style="font-family: helvetica;"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></span></div><div><span style="font-family: helvetica;"> <meta http-equiv="X-UA-Compatible" content="IE=8" /></span></div><div><span style="font-family: helvetica;"> <title>Collation Editor</title></span></div><div><span style="font-family: helvetica;"> <meta name="description" content="Collation and Apparatus Editor" /></span></div><div><span style="font-family: helvetica;"> <meta http-equiv="X-UA-Compatible" content="IE=edge" /></span></div><div><span style="font-family: helvetica;"> <script></span></div><div><span style="font-family: helvetica;"> var SITE_DOMAIN = "http://localhost:8080";</span></div><div><span style="font-family: helvetica;"> var staticUrl = SITE_DOMAIN + '/collation/';</span></div><div><span style="font-family: helvetica;"> </script></span></div><div><span style="font-family: helvetica;"> <script type="text/javascript" src="/collation/js/jquery-3.3.1.min.js"></script></span></div><div><span style="font-family: helvetica;"> <script type="text/javascript" src="/collation/js/jquery-ui.min.js"></script></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><link rel=stylesheet href="/collation/pure-release-1.0.0/pure-min.css" </span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span>type="text/css"/></span></span></div><div><span style="font-family: helvetica;"> <script type="text/javascript" src="/collation/CE_core/js/collation_editor.js"></script></span></div><div><span style="font-family: helvetica;"> <script type="text/javascript"></span></div><div><span style="font-family: helvetica;"> var servicesFile = 'js/local_services.js';</span></div><div><span style="font-family: helvetica;"> collation_editor.init();</span></div><div><span style="font-family: helvetica;"> </script></span></div><div><span style="font-family: helvetica;"></head></span></div><div><span style="font-family: helvetica;"><body oncontextmenu="return false;"></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><div id="header" class="collation_header"></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><h1 id="stage_id">Collation</h1></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><h1 id="project_name"></h1></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><div id="login_status"></div></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span></div></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><div id="container"></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><p>Loading, Please wait.</p></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><br/></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><br/></span></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span></div></span></span></div><div><span style="font-family: helvetica;"> <span style="white-space: pre;"> </span><div id="footer"></div></span></div><div><span style="white-space: normal;"><span style="font-family: helvetica;"><span style="white-space: pre;"> </span><div id="tool_tip" class="tooltip"></div></span></span></div><div><span style="font-family: helvetica;"></body></span></div><div><span style="font-family: helvetica;"></html></span></div></div></blockquote><p>Note that the "src" and "href" attributes direct to "/collation..." not to "collation..". The preceding "/" is important as this sends the server to look for these files in the root "collation" folder.</p><div><div><div>With this structure we can start up an instance of the Collation Editor with both Python and CollateX in place by going to the terminal, moving into the root directory thus:</div><blockquote><div>cd /Applications/Collation_Editor_Core<br /></div></blockquote><p>And then starting up the instance with</p><blockquote><p>./startup.sh</p></blockquote><p>This calls Python 3 to start a server at localhost:8080, with the "collation" folder as the root, and running the Python .py files in the "collation/core" folder. It also starts up CollateX, from the "collatex" folder at the root, with CollateX running on another. If all is in place, this is what you will see when you go to "http://localhost:8080/collation/" in your browser.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEho6k8jciTjiwdavHFp7bTVQnWow3q3aM3Ym833QMPeu61Vxbsy3k1nc75TM2iejfqoC5qKGsc--Rcam9RmWOwhub5ve2I5OF0uh3Qfc9-ERga1EZhLg7JWRROHDkQApwCcgGQjILHnXpjiowuH7kq8iyn3Ax8Qhl9SAtBWiCBBhUIMF4pcxBn21Kl5/s1550/files.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1264" data-original-width="1550" height="261" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEho6k8jciTjiwdavHFp7bTVQnWow3q3aM3Ym833QMPeu61Vxbsy3k1nc75TM2iejfqoC5qKGsc--Rcam9RmWOwhub5ve2I5OF0uh3Qfc9-ERga1EZhLg7JWRROHDkQApwCcgGQjILHnXpjiowuH7kq8iyn3Ax8Qhl9SAtBWiCBBhUIMF4pcxBn21Kl5/s320/files.png" width="320" /></a></div><p>If you have the "data" folder from the stand-alone installation in the "collation" folder, you can type in "B04K6V23" into the "Select" box and then hit the "Collate Project Witnesses" button (currently not working ...)</p><blockquote><p> </p></blockquote></div></div>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-80773590438319147392023-09-17T00:12:00.002-07:002023-09-17T00:12:24.801-07:00Setting up the revised Collation editor: some history (2023)<p> I am a huge fan of the "Collation Editor", built by Cat Smith of the Institute for Textual Scholarship and Electronic Editing (ITSEE) at the University of Birmingham, with substantial input from Troy Griffitts, now at The Göttingen Academy of Sciences and Humanities in Lower Saxony. Some history is required. The roots of the Collation Editor lie in my <i>Collate</i> software, written for the Macintosh computer from 1989 on and, in its day, used heavily by multiple editing projects. Notable among these user projects were two groups editing Biblical texts: those associated with the Institute for New Testament research at Münster, Germany (<a href="http://egora.uni-muenster.de/intf/index_en.shtml">INTF</a>), and David Parker and scholars working with him at the University of Birmingham (now, <a href="https://www.birmingham.ac.uk/research/itsee/index.aspx">ITSEE</a>). </p><p>Part of the story of how Collate begat CollateX, and CollateX begat the Collation Editor, is told in other blogs on this site: <span color="rgba(0, 0, 0, 0.52)" face="Roboto, RobotoDraft, Helvetica, Arial, sans-serif" style="background-color: white; font-size: 14px;"><a href="https://scholarlydigitaleditions.blogspot.com/2014/09/the-history-of-collate.html">https://scholarlydigitaleditions.blogspot.com/2014/09/the-history-of-collate.html</a> and <a href="https://scholarlydigitaleditions.blogspot.com/2014/09/collate-2-and-design-for-its-successor.html">https://scholarlydigitaleditions.blogspot.com/2014/09/collate-2-and-design-for-its-successor.html</a>. </span>These blogs, though here dated 2014, were written in 2007. Other parts can be deduced from an article about the evolution of digital methods in the INTF and ITSEE written by myself, David Parker, Hugh Houghton and Klaus Wachtel (you can read that article at my <a href="https://www.academia.edu/79459437/The_Edition_Critica_Maior_Twenty_Years_of_Digital_Collaboration">Academia</a> site, or via its <a href="https://doi.org/10.1628/ec-2020-0009">DOI</a>). </p><p>The first part of this begetting is the making of <a href="https://collatex.net/">CollateX</a>. CollateX fulfilled completely the first part of the agenda I laid out in the blogs on this site: to create a system for comparison of multiple texts which was modular and independent of any one hardware or software implementation. CollateX is a marvel, and a remarkable achievement by the team of software engineers who made it (prominently, Ronald Dekker of the Huygens Institute, Amsterdam). </p><p>The second part of this begetting was the making of the Collation Editor. This creates an entire environment permitting editors to create exactly the collation they want, by determining through a point-and-click interface exactly what words collate with what and how the collation is to be expressed. Essentially, the Collation Editor is an interface to, and an extension of, CollateX: permitting editors to adjust the CollateX collations to create exactly the collations they want. For me, the test of the Collation Editor, and its implementation of CollateX, was simple: could we achieve exactly the same complex collations with the Collation Editor/CollateX as we could, from 1995 to around 2015, with Collate? The answer is, triumphantly, yes. Indeed, we could achieve far more with the Collation Editor than we ever could with <i>Collate</i>. Here is the tool I dreamed of in 2007. (Somewhere, I said that it would take a team of ten people ten years to make the replacement for <i>Collate</i>. I was not far wrong).</p><p>Accordingly, in 2016 I started work on integrating the Collation Editor into Textual Communities. We have now used this integrated implementation to collate some four thousand lines of the Canterbury Tales, in preparation of our forthcoming Critical Edition of the Tales. You can see how this works in a video I made, collating just one line of the Tales. As you can see, the Collation Editor can create exactly the highly-complex collations we want. In the last years, it has become an absolutely vital part of our work on the Tales. However, the version we integrated in 2016, and which is still the version we are using, is now seriously outdated. Many improvements have been made to the Collation Editor since 2016 (or, in effect, 2019, when we last updated our implementation of the Collation Editor) and finally, thanks to a sabbatical, I am setting out to bring the Textual Communities version of the Collation Editor up to date. This task should be greatly eased by the re-organization and rewriting of the Collation Editor since 2019. The Collation Editor code has now been cleanly divided into a "core" code library, designed so that the whole core can fit inside any implementation and be easily updated, and a "services" code library, which connects the core to whatever implementation you want. In our case, we use MongoDB document databases to store all our information about our texts, and hence everything the Collation Editor needs to function should be linked to our MongoDB databases.</p><p>In the next posts, I will explain how I went about setting up the updated core collation tools of the Colllation Editor to work within Textual Communities, in the same way as a series of blogs on StaticSearch explain how I got this to work with our data.</p><p><br /></p><p><br /></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-75498446540167007192023-08-06T07:55:00.005-07:002023-08-06T08:15:29.236-07:00In praise of staticSearch<p>Over the last few weeks, I have worked intensively with staticSearch to integrate it into our forthcoming publication of Edvige Agostinelli and Bill Coleman's digital edition of Boccaccio's <i>Teseida</i>. You can see the near-complete prototype at <a href="http://inklesseditions.com/TeseidaStatic/">http://inklesseditions.com/TeseidaStatic/</a>. Note that this is still a prototype only, and the address will change as we move to full publication; please DO NOT repost this link on the open internet. Among other matters, this is "Endings"-conformant in its principal components: see <a href="https://scholarlydigitaleditions.blogspot.com/2023/07/the-endings-project-and-canterbury.html">https://scholarlydigitaleditions.blogspot.com/2023/07/the-endings-project-and-canterbury.html</a>. </p><p>As you can see (try searching for "come" by typing it in the search box and hitting return, or the search icon) staticSearch works beautifully here. And hence my final word on staticSearch. This is a quite wonderful tool. It is lightning fast, easy to set up, and works like a dream. As a true Endings tool, it has no dependencies on any outside systems of any kind. Take a bow, Martin Holmes and Joey Takeda (and everyone else who has contributed). Great work.</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5u-xXHsjiddxKJTHNAGWY3wD7YpRIMNEhymrT7t00NhAM73qL2eeq1F8B9upakds1IGpBZHStA-uM4HEFxXwxTrDgmHiK0kqDkwN890f3JaU8pJQ4qqMhrKjuxX7gQ5_0hjNojZRN-nIPPEeHftVd7NpFaf4hCsxxYzmNWNALc0u0YwyWyCp1w3Jr/s1868/teseida.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1430" data-original-width="1868" height="369" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5u-xXHsjiddxKJTHNAGWY3wD7YpRIMNEhymrT7t00NhAM73qL2eeq1F8B9upakds1IGpBZHStA-uM4HEFxXwxTrDgmHiK0kqDkwN890f3JaU8pJQ4qqMhrKjuxX7gQ5_0hjNojZRN-nIPPEeHftVd7NpFaf4hCsxxYzmNWNALc0u0YwyWyCp1w3Jr/w482-h369/teseida.png" width="482" /></a></div><br /><p><br /></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-20681951017104275012023-07-30T23:13:00.003-07:002023-08-06T07:39:17.063-07:00Setting up staticSearch for our projects: nested files and multiple search entry points<p> I now realize (a week later!), after looking at the staticSearch projects listed in the <a href="https://endings.uvic.ca/staticSearch/docs/projectsUsingSS.html">documentation</a> two things I did not know before, two things where our projects differ (it seems) from all other StaticSearch implementations to date:</p><p></p><ol><li>staticSearch assumes (or at least, all the listed projects appear to follow this model) that all the pages to be searched are held in the same folder as the root index.html folder. Indeed, a <a href="https://zenodo.org/record/3449197">2019 presentation </a>by the staticSearch team explictly declares that "All pages live together in the same folder" and, furthermore, "We don't care" if that means there are 10,591 files in that one folder.</li><li>staticSearch assumes (or at least, all the listed projects appear to follow this model) that all searches are launched from a single place, and a single file, contained in that same folder holding all the project files.</li></ol>Neither of these assumptions hold good for our projects. I anticipate that the Canterbury Tales Project when complete (!) will require somewhere around 90,000 distinct html files: one for each of the 29,000 manuscript pages in which the <i>Tales</i> occur; three files for each of the some some 20,000 entities (lines of poetry, blocks of prose) which constitute the text of the <i>Tales</i>. I, for one, am not comfortable with around 90,000 files in a single folder. We devised a uniform directory structure to hold all these files. The transcript of folio 1r in Hengwrt is held in "html/transcripts/Hg/1r.html"; the collation of the first line of the General Prologue is held in "html/collations/GP/1.html". By design, then, all our html files are buried four layers below the "home" folder holding our index.html file.<p></p><p>In fact, we discovered that the '<recurse>true</recurse>' statement in the configuration files means that staticSearch has no problem at all with nested directories. It duly finds and indexes all our html pages. But the second issue -- that the default staticSearch configuration expects that all searches will be run from a single file, located in the project home directory -- does cause problems. We could, quite easily, have set up our projects the same way as staticSearch expects, so that clicking on a "search" icon or similar on each of the 90,000 pages would send the reader to a single search page, presumably in the home directory. But we did not want to do that. Here is how the header for one of our project pages looks (for folio 72r of the Naples manuscript of the forthcoming Agostinelli/Coleman edition of Boccaccio's <i>Teseida </i>looks:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgK9P35uj-C_reEKMc2ADoPxhby0Z4D6GueRRLJPy2rgzgLNGDAR-qiHJZjnN59OOnsJ5xEnDC_tsKJQmtoeA43v7fSmNP9uoawb9XV_WUay8pyLP0ZdlimLM30eDQVy8ThgcAA0LVxi85_4L6fYksFjT8j2V9WXQQ2laFPEuCzZVXLXrEPQXBrwrOR/s1666/preview.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="312" data-original-width="1666" height="95" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgK9P35uj-C_reEKMc2ADoPxhby0Z4D6GueRRLJPy2rgzgLNGDAR-qiHJZjnN59OOnsJ5xEnDC_tsKJQmtoeA43v7fSmNP9uoawb9XV_WUay8pyLP0ZdlimLM30eDQVy8ThgcAA0LVxi85_4L6fYksFjT8j2V9WXQQ2laFPEuCzZVXLXrEPQXBrwrOR/w507-h95/preview.png" width="507" /></a></div><p style="text-align: left;">A fundamental principle of our edition design is "have only the pages you really need". We want our readers to be able to run the search directly from the page they are looking at, and not have to go to any other page to do the search. Further, we want the header on all our pages to look the same, following another mantra: "keep everything as uniform as possible across the whole edition". This meant that every one of our (possibly) 90,000 html pages would have a search box on it, as you can see in the top right of this image. This means too that searches would not always begin from a file located at the root of project folder. Indeed, all searches except those run from the index.html starting point to our editions would begin from a file nested four layers deep in the project folder. And that is why we found the problems with folder paths referred to in the previous post.</p><div><p>I will post a suggestion in the staticSearch issues forum as to how staticSearch itself could help projects configured like ours, with many files spread over multiple folders and each file being a search access point. In a final post in this series, I offer some general thoughts about staticSearch.</p></div>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-3045824964265398222023-07-24T03:29:00.005-07:002023-08-05T21:31:18.127-07:00Setting up staticSearch for our project: the search results<p> In the <a href="https://scholarlydigitaleditions.blogspot.com/2023/07/setting-up-static-search-for-our.html">last post,</a> we integrated the search page into the header of our pages. Now, we need to deal with the search results. staticSearch by default places the search results in the <div id="ssResults"> element. But as part of our set up, we hid that element. Instead, we want the search results to appear in a different place on the page: in a <div id="searchContainer">. So how do we do this?</p><p>staticSearch has anticipated that users might want to intervene at the end of the search process to adjust how and where the search results appear. You can find this out by digging into the ssSearch-debug.js file which staticSearch helpfully makes available (it is in the staticSearch folder which Ant makes in your project folder). In it you will find references to a "searchFinishedHook" function, created explicitly as a hook where developers such as me can get at the results of the search and manipulate them before they are seen by the user. The definition of searchFinishedHook is left open in the staticSearch initialization: </p><p> this.searchFinishedHook = function(num){};</p><p> Accordingly we can redefine searchFinishedHook to let us do what we want to the search results. In this piece of code in the ssInitialize.js file in the staticSearth folder, we define our own searchFinishedHook function:</p><p> window.addEventListener('load', function() {Sch = new StaticSearch();<br /> Sch.searchFinishedHook = function (num) {<br /> $("#splash").hide();<br /> $("#rTable").hide();<br /> $("#searchContainer").html($("#ssResults").html());<br /> $("#searchContainer").show();<br /> }<br /> });</p><p>The first two lines hide the "splash" and "rTable" elements on the page. The next line copies all the search results for the hidden "ssResults" element into the "searchContainer" element, and the last line shows that element. Here is what it looks like for a simple search:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMniTCYHfsVcLsAjQ9A1ROQGw4Qy6hK8i0Gl059sob5HXhSnoQiSrzoEoJfzQuyM8lK3LpjIAogjDf-D9pxy0hTFlGFAKsXMPaT8NxZoD_Antcrc2YJ6G2IHFeqB6uKI1Yasag2RmAx0I31zlmVw0gd2xv_ms-wp7PByR2yN_QaOzAwmV-cbqgVknd/s1908/headingxx.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1144" data-original-width="1908" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMniTCYHfsVcLsAjQ9A1ROQGw4Qy6hK8i0Gl059sob5HXhSnoQiSrzoEoJfzQuyM8lK3LpjIAogjDf-D9pxy0hTFlGFAKsXMPaT8NxZoD_Antcrc2YJ6G2IHFeqB6uKI1Yasag2RmAx0I31zlmVw0gd2xv_ms-wp7PByR2yN_QaOzAwmV-cbqgVknd/w507-h304/headingxx.jpg" width="507" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p style="clear: both; text-align: left;">For a first, straight out-of-the-box effort, this is really impressive! And also lightning fast. You can see too that staticSearch has set us up with links to each page. In our project, we have the base index.html at the root of our project folder. All the files with transcripts of each page are in a folder labelled "html", with a subfolder "transcripts', a subfolder for each manuscript, and finally the html for each page. Thus the transcript for folio 1r of manuscript AUT is in html/transcripts/AUT/1r.html.</p><p style="clear: both; text-align: left;">staticSearch knows that at each page it indexes is in the folder "html/transcripts/..." relative to the index.html file. Accordingly, staticSearch creates a link to each page as (for example) href="html/transcripts/AUT/1r.html". The links to files works fine for searches from our index.html file. Following standard web protocols, the path "html/transcripts/AUT/1r.html" is appended to the path to the index.html file (e.g. "https://www.inklesseditions.com/teseida/," thus becoming "https://www.inklesseditions.com/teseida/html/transcripts/AUT/1r.html".</p><p style="clear: both; text-align: left;">However, we want to do searches not just from the root index.html file, but also from every transcript file. That is: the calling file for each search is NOT at the root directory, where the index.hmtl file is, but in a directory nested several layers deep within the root. For example, a search from the transcript file for page 1r of the AUT manuscript will be sent from html/transcripts/AUT/1r.html. This affects all the file calls needed and created for staticSearch. Accordingly, we have to make multiple adjustments within the file 1r.html to prepare it for static search:</p><p style="clear: both; text-align: left;"> The calls to ssinitialize.js and ssSearch.js have to go to ../../../staticSearch/ssinitialize.js and <br /> ../../../staticSearch/ssSearch.js (the same for ssSearch-debug.js if you are using that instead)</p><p style="clear: both; text-align: left;"> The attribute value @data-ssfolder on the form id="ssForm" which runs the search has to be set <br /> to ../../../staticSearch and not just staticSearch</p><p style="clear: both; text-align: left;">With these adjustments, the search from 1r.html runs just fine. But we have a problem with the links to the pages in the search results. It is this: because staticSearch does not understand that the 1r.html file is buried in html/transcripts/AUT/ it fails to create valid links in the search results to the files containing the search hits. For example: the link from html/transcripts/AUT/1r.html to 1r.html of the NO manuscript should be html/transcripts/NO/1r.html. Instead, StaticSearch links to <br /> html/transcripts/AUT/html/transcripts/NO/1r.html</p><p>That is: it concatenates the file path for html/transcripts/NO/1r.html with the path for html/transcripts/AUT/. The path should actually be '../../../html/transcripts/NO/1r.html'. </p><p>It is not too difficult to fix these incorrect paths by calling a function to rewrite these internal links in our override of the searchFinishedHook function. But this is somewhat ugly. A better solution would be to have staticSearch recognize where it is functioning from in the file-system and adjust links accordingly in the Ant process. In the <a href="https://scholarlydigitaleditions.blogspot.com/2023/07/setting-up-staticsearch-for-our.html">next post</a>, I explore how these problems have arisen and what might be done about it.</p><p><br /></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-63553889315205593092023-07-23T07:32:00.002-07:002023-08-05T21:34:10.601-07:00Setting up staticSearch for our project: integrating the search box into our pages<p> In earlier posts, I describe the background to the decision to use staticSearch and my experience of getting it to work. In this post, I describe how we are winding staticSearch into our editions.</p><p>By default, staticSearch places everything it uses into a <div id="staticSearch"> element. So your core search page, typically the "index.html" file at the root of your document collection, has to contain a <div id="staticSearch"> </div> element. When you run the Ant process, as described in <a href="https://scholarlydigitaleditions.blogspot.com/2023/07/staticsearch-and-me.html">https://scholarlydigitaleditions.blogspot.com/2023/07/staticsearch-and-me.html</a>, the <div id="staticSearch"> gets populated with multiple javascript and html statements. Here is the beginning of what staticSearch pastes in, as of version 1.4.4:</p><p></p><blockquote><div style="text-align: left;"><div id="staticSearch"> </div></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><div style="text-align: left;"><script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssSearch-debug.js"></script> </div></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><div style="text-align: left;"><script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssInitialize.js"></script></div></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;"><noscript xmlns="http://www.w3.org/1999/xhtml">This page requires JavaScript.</noscript></span></p></blockquote></blockquote><blockquote><p><span style="white-space: normal;"><span style="white-space: pre;"> </span><form xmlns="http://www.w3.org/1999/xhtml" accept-charset="UTF-8" id="ssForm" </span></p></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;">data-allowphrasal="yes" data-allowwildcards="yes" data-minwordlength="2" </span></p></blockquote></blockquote></blockquote><blockquote><p><span style="white-space: normal;"><span style="white-space: pre;"> </span>data-scrolltotextfragment="no" data-maxkwicstoshow="5" data-resultsperpage="5" </span></p><p><span style="white-space: normal;"><span style="white-space: pre;"> </span>onsubmit="return false;" data-versionstring="" data-ssfolder="../../../staticSearch"</span></p><p><span style="white-space: normal;"><span style="white-space: pre;"> </span> data-kwictruncatestring="..." data-resultslimit="2000"></span></p></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;"><span style="white-space: pre;"> </span><span class="ssQueryAndButton"></span></p></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;"><span style="white-space: pre;"> </span><input type="text" id="ssQuery" aria-label="Search"/></span></p></blockquote></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;"><span style="white-space: pre;"> </span><button id="ssDoSearch">Search</button></span></p></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote><p style="text-align: left;"><span style="white-space: normal;"><span style="white-space: pre;"> </span></span></span></p></blockquote></blockquote><p style="text-align: left;"><span style="white-space: normal;"><span style="white-space: pre;"> </span></form></span></p><p style="text-align: left;"><span style="white-space: normal;">This fragment sets up the search form, which will appear where-ever you have put </span><div id="staticSearch"> in your document. In our implementation, we place it in the document header, where we want this to take up very little space. In the default implementation, the <form id="ssForm"> is followed by two other elements, thus:</p><p><span style="white-space: normal;"><span style="white-space: pre;"> </span> <div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching" >Searching...</div></span></p><p><span style="white-space: normal;"><span style="white-space: pre;"> </span><div xmlns="http://www.w3.org/1999/xhtml" id="ssResults"></div></span></p><p><span style="white-space: normal;"><span style="white-space: pre;"> </span><div xmlns="http://www.w3.org/1999/xhtml" id="ssPoweredBy"> //ssLogo etc</span></p><p><span style="white-space: normal;">If we keep this as is, here is what the top of our page looks like:</span></p><p><span style="white-space: normal;"></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-BogEofGXuaUQZqqFcER8uiSAhG2aeKmbAsdJA4iKfbpAu6PDpQI3u3tWqBkpuCAgRn7c7a5vq-Zgb2-9N_ZPmbgH3LfR7joiYyVlIbFzEwISX_2RIlIZ9FMjw3NVJvPkX3YT0NFK-c0xtlkz8TM8Q6KiknC5bzprkKe5QqV6VlQRfijTguZSH4eH/s2028/headingxx.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="258" data-original-width="2028" height="75" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-BogEofGXuaUQZqqFcER8uiSAhG2aeKmbAsdJA4iKfbpAu6PDpQI3u3tWqBkpuCAgRn7c7a5vq-Zgb2-9N_ZPmbgH3LfR7joiYyVlIbFzEwISX_2RIlIZ9FMjw3NVJvPkX3YT0NFK-c0xtlkz8TM8Q6KiknC5bzprkKe5QqV6VlQRfijTguZSH4eH/w586-h75/headingxx.jpg" width="586" /></a></div><br /><p style="clear: both; text-align: left;">This is rather ugly, as "ssSearching" and ""ssPoweredBy" elements intrude on our very clear header. So we can suppress those by adding 'style="display:none"' to those elements. We will also add 'style="display:none"' to the "ssResults" element: more on that in a moment. Thus:</p><p><span style="white-space: pre;"> </span> <div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching" <br /> style="display:none" >Searching...</div></p><p>Now, this is how it looks once those elements have been hidden:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9tP1_0EU0yqAkBp5AaMF4mo3ivks52Y_ofSuIrU8KOM6MgoXlbiPYOqFNbhEmFIaK5rN_Rr31-s8-7ZhD3Jku35pUQ99_lKK5AbP4liWZevE_lGklSqNR2-y9Atolq25gIo8E6J9255MECJszFSwqeIla4dX7cXEICjtxm3f0G1EuwKVuSQfWjvRG/s2126/headingxx.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="114" data-original-width="2126" height="32" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9tP1_0EU0yqAkBp5AaMF4mo3ivks52Y_ofSuIrU8KOM6MgoXlbiPYOqFNbhEmFIaK5rN_Rr31-s8-7ZhD3Jku35pUQ99_lKK5AbP4liWZevE_lGklSqNR2-y9Atolq25gIo8E6J9255MECJszFSwqeIla4dX7cXEICjtxm3f0G1EuwKVuSQfWjvRG/w614-h32/headingxx.jpg" width="614" /></a></div><br /><p>This could be better yet. Space in the header is at a premium: every letter counts. Instead of "Search", taking up rather a large box, we might have just a "?" or, even better, a search icon.S So we adjust the appearance of the "Search" button with this bit of css:</p><p> #ssDoSearch {<br /> background-image: url("../../common/images/searchicon.png");<br /> background-size: 15px;<br /> height: 24px;<br /> width: 23px;<br /> background-position: 50%;<br /> background-repeat: no-repeat;<br /> position: relative;<br /> top: 5px;<br /> }<br /> #ssQuery {<br /> width: 100px;<br /> }</p><p>The #ssQuery also makes the search box a bit narrower. So now it looks so:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiR_bUp_rhBLNauq69KiUCt8NkShXkU5uFozpwKg69bLH4EdsQaO_XLOtH4znMQHgRrPcFr_ieh8Y59_YCUiRg6UwGAHWS3zfcC0YyPQ6EI1JQXGDFisA3RdR0ZWh0aS6uJYOrSN8uA9iMfRBYcZB4FncXhGYtjl1DDTVTUW2PIUNvb44WlVP1YIyp/s2252/headingxx.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="104" data-original-width="2252" height="25" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiR_bUp_rhBLNauq69KiUCt8NkShXkU5uFozpwKg69bLH4EdsQaO_XLOtH4znMQHgRrPcFr_ieh8Y59_YCUiRg6UwGAHWS3zfcC0YyPQ6EI1JQXGDFisA3RdR0ZWh0aS6uJYOrSN8uA9iMfRBYcZB4FncXhGYtjl1DDTVTUW2PIUNvb44WlVP1YIyp/w550-h25/headingxx.jpg" width="550" /></a></div><p style="text-align: left;">This is beginning to look fine. Now, let's see what the search results look like in the next post.</p><p><br /></p><div><br /></div><br /><br /><p></p><p><span style="white-space: normal;"></span></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-79975840157138705112023-07-20T21:46:00.007-07:002023-08-05T21:36:18.346-07:00staticSearch and Me: getting started<p><a href="https://scholarlydigitaleditions.blogspot.com/2023/07/the-endings-project-and-canterbury.html"> In another post</a>, I explain the background to my work in making digital scholarly editions in relation to the Endings project, and how this led me to staticSearch. In this post, I describe my use of staticSearch, in hope that it might help others who, like me, want to include a seach engine in their online resource. I am using a Macintosh MacBook Pro, running Ventura 13.4 in July 2023 as I write this.</p><p>The documentation for staticSearch is at <a href="https://endings.uvic.ca/staticSearch/docs/index.html">https://endings.uvic.ca/staticSearch/docs/index.html</a>. The first step is to download the source code at https://github.com/projectEndings/staticSearch/releases/. This should arrive on your computer as a zip file named staticSearch-1.4.4.zip, or similar. Just double-click to unpack the zip file into a folder named staticSearch-1.4.4. Move that folder somewhere convenient from the downloads folder: easiest and simplest to put it all in your Applications folder.</p><p>Before you can do anything more: you need Apache Ant. This is a tool designed to build complex software projects from source. Ant will read a set of instructions: get this file! rebuild it this way! save it to this file! now get another sofrware process to use that file to convert other files into something else! and then create new objects (new files, new tools, new libraries) from those files! etc. etc. Section 7.7 of the StaticSearch document says laconically:</p><p><span face="georgia, sans-serif" style="background-color: #f9f9f9;"><span> </span>Note: you will need Java and </span><a class="link_ref" href="https://ant.apache.org/" style="background-color: #f9f9f9; color: #882200; font-family: georgia, sans-serif;">Apache Ant</a><span face="georgia, sans-serif" style="background-color: #f9f9f9;"> installed, as well as </span><a class="link_ref" href="http://ant-contrib.sourceforge.net/" style="background-color: #f9f9f9; color: #882200; font-family: georgia, sans-serif;">ant-contrib</a><span face="georgia, sans-serif" style="background-color: #f9f9f9;">.</span></p><p style="text-align: left;">You should have Java already, in an up-to-date distribution, as part of your computer. But you may need to get Apache Ant. You get it from https://ant.apache.org/srcdownload.cgi. Look for the latest version: in July 2023, this was 1.9.16. This requires Java 5, which you should already have. Download the zip file, double-click to unpack it to a folder named apache-ant-1.10.13 (or similar). As before, move that folder into your Applications folder.</p><p style="text-align: left;">You also have to get ant-contrib. This is a little more complex. What you actually need are two Java .jar files, named "cpptasks-1.0b5.jar" and "ant-contrib-1.0.jar". It took me a while to figure this out. The Apache ant-contrib <a href="https://ant-contrib.sourceforge.net/cpptasks/index.html">page</a> gives you the source for cpptasks, and someone with more expertise and time than me could (I suppose) compile the source into a Java .jar. But I took the short-cut and found a copy of cpptasks-1.0b5.jar out there on the net (in my case, at https://jar-download.com/artifacts/ant-contrib/cpptasks/1.0b5#google_vignette). I found ant-contrib-1.0.jar at http://www.java2s.com/Code/Jar/a/Downloadantcontrib10jar.htm.</p><p style="text-align: left;">Once you have these .jar files: place them both in the lib directory of your Apache Ant folder. </p><p style="text-align: left;">You are now ready to test out staticSearch. Here's what you do:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>Open the Macintosh terminal application. You will find this in your Applications/Utilities folder. This is a good old-fashioned command-prompt system, like we all used back in the 80s (remember the 80s? Wham? Freddy Mercury? yes, those). </li><li>In the terminal: move into your static search folder. If you have unpacked it into Applications as "staticSearch-1.4.4" you should type "cd /Applications/staticSearch-1.4.4" into the terminal</li><li>Now you are ready to test out all is working. For this you have to run Ant. You do this with the following command at the terminal "/Applications/apache-ant-1.10.13/bin/./ant" (assuming you have got Ant in a directory named apache-ant-1.10.13 inside Applications"). If all is installed correctly you should see a lot of things on the screen and, finally, a triumphant "BUILD SUCCESSFUL" message comes up. (If you are smarter than I am you might be able to edit the $PATH statement in your terminal profile so that you just need to type "ant" into the terminal, and not "/Applications/apache-ant-1.10.13/bin/./ant". It seems Apple do not want you to edit your terminal profile, and are making this rather difficult: see <a href="https://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac">https://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac</a>.)</li></ol>Now, try it with your own HTML. The staticSearch documentation is excellent. I created a folder called "mystuff"inside the staticSearch folder. In this folder I put all my html, itself in another folder called "html". I had an index.html file in the root of the mystuff folder and I had an xml file called "ssconfig.xml" containing the key instructions directing staticSearch to work on my html:<p></p><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"><config xmlns="http://hcmc.uvic.ca/ns/staticSearch"></div></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"> <params></div></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"><searchFile>index.html</searchFile></div></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"><recurse>true</recurse></div></blockquote></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"> </params></div></blockquote></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div style="text-align: left;"></config></div></blockquote><p style="text-align: left;">I now ran staticSearch on my material with this command: </p><p style="text-align: left;"></p><blockquote><span style="font-family: Menlo; font-size: 11px; font-variant-ligatures: no-common-ligatures;">/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=/Applications/staticSearch-1.4.4/mystuff/ssconfig.xml</span></blockquote><div style="text-align: left;">(I could also have used just "<span style="font-family: Menlo; font-size: 11px; font-variant-ligatures: no-common-ligatures;">/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=mystuff/ssconfig.xml" </span><span style="font-variant-ligatures: no-common-ligatures;"><span style="font-family: inherit;">as I am already in the staticSearch folder)</span></span></div><div style="text-align: left;"><span style="font-variant-ligatures: no-common-ligatures;"><span style="font-family: inherit;"><br /></span></span></div><div style="text-align: left;"><span style="font-variant-ligatures: no-common-ligatures;"><span style="font-family: inherit;">The first time I tried this, it did not work. It turns out that the <params> declaration needs a whole lot more it it or you get a failed build. <params> needs to contain declarations as follows:</span></span></div><div style="text-align: left;"><span style="font-variant-ligatures: no-common-ligatures;"><span style="font-family: inherit;"><div> <phrasalSearch>true</phrasalSearch></div><div> <wildcardSearch>true</wildcardSearch></div><div> <createContexts>true</createContexts></div><div> <resultsPerPage>5</resultsPerPage></div><div> <minWordLength>2</minWordLength></div><div> <maxKwicsToHarvest>5</maxKwicsToHarvest></div><div> <maxKwicsToShow>5</maxKwicsToShow></div><div> <totalKwicLength>15</totalKwicLength></div><div> <kwicTruncateString>...</kwicTruncateString></div><div> <verbose>false</verbose></div><div> <stopwordsFile>test_stopwords.txt</stopwordsFile></div><div> <dictionaryFile>english_words.txt</dictionaryFile></div><div> <indentJSON>true</indentJSON></div><div>It turns out that this issue is a part of a wider discussion in the SS community on what needs to be declared in the set-up, and what can be set as defaults. See the discussion in the comments on <a href="https://github.com/projectEndings/staticSearch/issues/270">https://github.com/projectEndings/staticSearch/issues/270</a>, where I first reported my experience, and on <a href="https://github.com/projectEndings/staticSearch/issues/195">https://github.com/projectEndings/staticSearch/issues/195</a>, where the wider discussion takes place.</div><div><br /></div><div>Now that I had staticSearch running: the next step was to start integrating it into our own HTML. That's the subject of the next post.</div><div><br /></div><div><br /></div><div><br /></div></span></span></div><div style="text-align: left;"><span style="font-variant-ligatures: no-common-ligatures;"><span style="font-family: inherit;"><br /></span></span></div><p></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-77099230416293445072023-07-20T04:26:00.005-07:002023-08-05T21:21:53.932-07:00The Endings Project and the Canterbury Tales Project (and also, Boccaccio and Dante)<p> At last, after many years, we (ie, me and a few other people) are getting ready to unleash on the world a whole series of digital scholarly editions. We have already released the second edition of Prue Shaw's <i>Commedia</i>, now at <a href="http://www.dantecommedia.it">www.dantecommedia.it</a>. We are now contemplating a third edition of that. Soon to come are Bill Coleman and Edvige Agostinelli's edition of Boccaccio's <i>Teseida</i>. And then the really big one: the first tranches of the <i>Critical Edition of the Canterbury Tales. Based on All Known Pre-1500 Witnesses</i>, with myself and Barbara Bordalejo as General Editors. All of these will appear in the next twelve months.</p><p>Why so long? We (as before) have been working on all these since the 1990s (the Dante and Chaucer) and 2000s (Boccaccio). There are multiple reasons. For this post, one reason is specially important: we wanted to be sure the edition could survive the chances of online time. It should stand alone, for decades and even centuries to come, as surely as a print edition might survive upon a library shelf. How could we achieve this, given all the shifting currents of the digital world?</p><p>We were not the only people worrying about this. From 2016 a five-year SSHRC grant (Canada) funded the <a href="https://endings.uvic.ca/">Endings project</a>. This project took as its starting point a number of digital projects based at the University of Victoria which faced exactly the same issue we had: how can these projects be given the best chance of survival long into the future? In fact, I did not come across the Endings project until a long way into the making of Shaw's second Commedia edition. By this time I had already reached identical (or nearly so) conclusions as the Endings project, as follows:</p><p>1. While our development of these editions had used custom database technologies to present and edit all project data, our published editions would not use databases or any related "server-side" technology at all: no databases, no PHP, no python, nothing. That is: everything would be contained on one server with no outside dependences at all so far as our texts are concerned</p><p>2. Our presentation of the texts would rely solely on the core web technologies of HTML5, css and javascript. Nothing else.</p><p>3. Any departures from these principles for any part of our edition (for example: the use of external JavaScript libraries; the use of IIIF image viewers) would use widely-used open source tools.</p><p>These principles correspond the <a href="https://endings.uvic.ca/principles.html">Endings project principles</a> 4.1, 4.2 and 4.9. In some areas, however, our practice differs from that of the Endings project. For example, we do use the JQuery library, which in my view has now achieved core web technology status. I think the same is becoming true of the IIIF family. However, I do not think the same is true of XML technologies (nor, interestingly, do the Endings people) and we do not use XSLT, etc, as any part of our final publication model. We also use query strings, which again seem to me a core web technology, where Endings does not. Nor do we aim for "graceful failure" where css/javascript/something else does not work. It seems to me that providing all source data within the edition, permitting others to fashion new interfaces to our data, is the best way of anticipating any failure.</p><p>One might object: we are making a bet on certain core technologies now still being core technologies centuries in the future. Yes we are. But we see this bet as being in the same category as the bet scholars have made for millennia: that there will be a library or other place somewhere in the future which has a shelf for my book.</p><p>Another principle of the Endings project is that it will not use an external service to provide functionality, and specifically names Google Search as such a service. In my early preparations for the Shaw edition, I had investigated using Google Search to provide a search tool. Indeed, the second edition at <a href="http://www.dantecommedia.it">www.dantecommedia.it</a> implements searching in exactly this way. You can see from just a cursory use of Google Search in the second edition how unsatisfactory it is. Searching for "come", one of the most common words in the <i>Commedia</i>, gives just one result; "tanto" yields none at all. Many search results begin with advertisements, for holidays, or beer. I spent many hours trying to get Google Search to do better, including feeding it hard-wired urls to every page of transcription. Nothing seemed to work. It appears the Google algorithms rebel when faced with nine near-identical texts, and fail over and over to return anything like meaningful results. </p><p>For these reasons I was contemplating just how a stand-alone search system might be implemented, when I came across the Endings project, and StaticSearch. They had done it! and it worked! <a href="https://scholarlydigitaleditions.blogspot.com/2023/07/staticsearch-and-me.html">On another page</a>, I describe my experiences of StaticSearch.</p><p><br /></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-2709222988844030672021-07-20T07:23:00.003-07:002021-07-20T08:10:31.427-07:00Fun with Fonts. Junicode, Unicode, and ꝑ<p> If you see a character looking like a p with a bar through the descender in the title of this post, and you see it here too <span style="font-size: large;">ꝑ</span>, then ... read on. And if you don't, then read on (and let me know!)</p><p>Thirty years ago, when myself, Tim Berners-Lee, Lou Burnard and the web were much younger, every "special character" was a challenge, and a potential triumph or failure. "Special" meant something beyond ASCI 127 (ah, the acronyms!). It meant anything non-English, in the most limited BREXIT sense. E-acute was used by people from across the Channel, and a few Canadians, and not to be used without Special Equipment (in those days, a Macintosh computer). Devanagari was a distant dream, and right-to-left writing, an impossibility.</p><p>Nowadays, thanks to Unicode, and the work of many unsung heroes of font-design, with a special shout-out to those who sat on myriad committees and shepherded the whole process to every smart phone on the planet, we have become so used to everything appearing just right, with no effort at all on our part, that we are in danger of forgetting how many miracles had to occur so that I can insert a <span style="font-size: large;">ꝑ </span>in my document, and you can see it. (The best miracles are made by people working together, of course). But every now and then, something happens to remind us of how many ducks make a row.</p><p>Like many medievalists, I am a fan of Peter Baker's beautiful Junicode font. For years, I have been happily typing <span style="font-size: x-large;">ꝑ </span>into transcriptions, Word and pdf documents. This and a few other characters are very common in many medieval vernacular and Latin manuscripts. <span style="font-size: x-large;">ꝑ</span> is used as an abbreviation for per or par, as in "person" and "parish", and so found everywhere in Chaucer manuscripts (think of the Parson and the Pardoner). One of the great joys of Junicode is that it shows this character in a particularly elegant form, appearing as </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ5uCv4Sg75eTqIGQrfN1L3kk-30LexSmedqVtghs2RdeLnHXokGO0NhHIrfG33H6Z8-7J7ilI07Vvn2_jzwcGjnGsiNC8FzlqMlFDR-NXPGjwU2hbEadUM9hbpHbOe87W38-BUKZA3g/s78/pbar.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="78" data-original-width="66" height="48" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ5uCv4Sg75eTqIGQrfN1L3kk-30LexSmedqVtghs2RdeLnHXokGO0NhHIrfG33H6Z8-7J7ilI07Vvn2_jzwcGjnGsiNC8FzlqMlFDR-NXPGjwU2hbEadUM9hbpHbOe87W38-BUKZA3g/w41-h48/pbar.png" width="41" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Over the years, we have used Junicode in all our work with medieval texts, and have become so accustomed to the daily miracle of Junicode that we don't think about it. It works. "We" is all the people who work on the <a href="https://www.canterburytalesproject.org/">Canterbury Tales Project</a> and a few other projects -- particularly Dante. I am currently working with various Dante scholars on a new publication, coming soon to a browser near you. Trust me, you will know about this when it happens. So, imagine my surprise when after so many years of trouble-free use, my main collaborator said that our elegant Junicode p with bar appeared as a horrid oversize black character on her computer, thus:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvxb34wTVWtR6N-ajptNINoXzfsbLUnMFQe0MttT870qRyt9SMWiOpQ53HtJSvTQKTUZnxP-qFuFsf7U5M5_mtRO5AUAcwANEwXsUW4Nl1bgTSADudhyphenhyphen9Mv-SOOgLTpK5zLRgvOWCwtg/s112/pbar.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="112" data-original-width="92" height="57" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvxb34wTVWtR6N-ajptNINoXzfsbLUnMFQe0MttT870qRyt9SMWiOpQ53HtJSvTQKTUZnxP-qFuFsf7U5M5_mtRO5AUAcwANEwXsUW4Nl1bgTSADudhyphenhyphen9Mv-SOOgLTpK5zLRgvOWCwtg/w47-h57/pbar.png" width="47" /></a></div><br /><div class="separator" style="clear: both; text-align: left;">At first, I thought this was just an aberration, something odd about the way her computer was set up. The character appeared fine on my computer, and on various other computers I looked at, but not on hers. Why not? Down the rabbithole I went.</div><p style="clear: both; text-align: left;">By this time, we had graduated to bundling the Junicode font with our developing site, so that readers would not have to download the font to their computer. This a well-documented process, and Squirrel font <a href="https://www.fontsquirrel.com/tools/webfont-generator">documents it and provides neat tools</a> to convert any font to a "webfont", easily embeddable in any web page. So I began investigating. On my computer, the character appeared fine:</p><p style="clear: both; text-align: left;"></p><ul style="text-align: left;"><li>if I had Junicode on my computer, and the font embedded in the page</li><li>if I had Junicode on my computer, and the font NOT embedded in the page</li></ul><div>It did NOT appear fine if I did NOT have Junicode on my computer and had only Junicode embedded in the web page. Yet the web page showed Junicode everywhere else -- but not this character, and a few other characters. How could this be? </div><p style="text-align: left;">I began digging. The unicode code point for p with a bar is A751. This is in the "general use" area of unicode, which major fonts will support as a matter of course: so you can paste the ꝑ from this document into a Word document and use it in Times New Roman, Geneva, etc. When I looked at Junicode in my computer, using Apple's Font Book, p with a bar appeared as glyph 2007, Unicode A751, exactly as it should:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCbXIfg7agQTd3atuZGdE4kXobXh94RRrQSKG2HjRQ1Oq5Od0lxgFtc3Et_FsrkpqqRDImzLsouvyPpJscN6gbK_zNIrZMyJvdFqrKglKkkvcEyjrlFWhdJ96kBS54N7ihCUmb2NldGQ/s518/fontbook.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="172" data-original-width="518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCbXIfg7agQTd3atuZGdE4kXobXh94RRrQSKG2HjRQ1Oq5Od0lxgFtc3Et_FsrkpqqRDImzLsouvyPpJscN6gbK_zNIrZMyJvdFqrKglKkkvcEyjrlFWhdJ96kBS54N7ihCUmb2NldGQ/s320/fontbook.png" width="320" /></a></div><p style="clear: both; text-align: left;">However, on my collaborator's computer, the same character appeared in a quite different place: as glyph 2066, unicode E670 (on my computer, Junicode has a quite different character at glyph 2007). </p><p style="clear: both; text-align: left;">What is going on? Why is her Junicode different from mine? On digging about, it appears that some time in the past, Junicode indeed had this character at E670. The "E" and "F" unicode ranges are "Private Use" areas, and it appears that up to the time when p with a bar was allocated A751 in the "general use" area, Junicode put p with a bar in the "private use" area, with that encoding. This is a rather long story, involving a group called the <a href="https://en.wikipedia.org/wiki/Medieval_Unicode_Font_Initiative">Medieval Unicode Font Initiative</a> (MUFI). One of the aims of this group was to have "core" characters judged as essential to scholars working with medieval western European texts incorporated into the "official" Unicode encoding. As of Unicode 5.1, 152 MUFI characters -- among them, p with a bar -- had <a href="https://en.wikipedia.org/wiki/Private_Use_Areas">made it</a> into official unicode. It appears that my version of Junicode reflects this shift of p with a bar into official, post 5.1, unicode. The version of Junicode on Prue's computer did not.</p><p style="clear: both; text-align: left;">More digging. By this time, I was suspecting that the embeddable version of Junicode did not have p with a bar at A751. But why did it display correctly on my computer? It appears that somewhere deep in the innards was an instruction to the effect: if the browser could not find the character in the embedded font, look elsewhere: so it looked in the Junicode on my computer, found it and displayed it. It did this even when I tried to fool it by calling the embedded font something else in the CSS ("junicoderegular") style sheet. However, on my collaborator's computer the character did not appear as A751, and so it showed an A751 from another font altogether.</p><p style="clear: both; text-align: left;">Eventually, after scores of emails and hours of digging, I concluded that the root of the problem lay in the embedded font. Somehow, this embedded Junicode did not have p bar where it should be. So I set to trying to correct this. First I went to the Squirrel font generator:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGuzSM_MhSIZPx0plzo1Ticv-aIqiJg8of0fX3m26_ZpYHoUs5_HKFdkoB8EZdr5EnkDzR4dMErCFpZSljI10SwVksPX5CvbVz5C6BbOvIwscBRtgv2cnzzcSEiVhcr4r5QxnfWloAMg/s1520/font+squirrel.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1152" data-original-width="1520" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGuzSM_MhSIZPx0plzo1Ticv-aIqiJg8of0fX3m26_ZpYHoUs5_HKFdkoB8EZdr5EnkDzR4dMErCFpZSljI10SwVksPX5CvbVz5C6BbOvIwscBRtgv2cnzzcSEiVhcr4r5QxnfWloAMg/s320/font+squirrel.png" width="320" /></a></div><p style="clear: both; text-align: left;">I uploaded the Junicode TTF from my computer, Squirrel converted it to a "webfont", and all seemed fine. Nope. Same problem. I dug deeper. I went to <a href="https://www.fontsquirrel.com/fonts/junicode">Peter Baker's "Junicode" page</a> on FontSquirrel and used the "webfont kit" generator on that page. Nope. Same problem. With increasing desperation, I noticed that the page offered a choice of "subsets":</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja4lfB604e2Z8LALnzQantBndpy_2lBEtCzOkEdjMHQkpaRJWtS_cNjJjnOvIVcMJQH4LBnOvnK8Zi-Pp0c2cZTdBxtHeijdo269lpNrjbwHxI6qWO_OSI8cD43S5NsDM-Q46VA5nPig/s1458/subsets.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="180" data-original-width="1458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja4lfB604e2Z8LALnzQantBndpy_2lBEtCzOkEdjMHQkpaRJWtS_cNjJjnOvIVcMJQH4LBnOvnK8Zi-Pp0c2cZTdBxtHeijdo269lpNrjbwHxI6qWO_OSI8cD43S5NsDM-Q46VA5nPig/s320/subsets.png" width="320" /></a></div><p style="clear: both; text-align: left;">So, I chose "no subsetting" and created the webfont. And at last! it worked!</p><p style="clear: both; text-align: left;">All this for characters which appear just five times in some 2400 pages of manuscript transcription. </p><p style="clear: both; text-align: left;">This tale casts into relief the many rough edges that exist in the interplay of fonts, glyphs, character coding points, unicode spaces, and encoding systems (utf8? or 16? BOM or not?), all playing against multiple versions as all of these evolve and agreements are forged and renewed. The wonder is that problems like these occur so rarely.</p><p style="clear: both; text-align: left;"><br /></p><br /><p style="text-align: left;"><br /></p><p></p><div class="separator" style="clear: both; text-align: left;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p></p>PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-26574002352519410582018-02-13T05:03:00.000-08:002018-02-16T11:49:04.674-08:00Getting Started with Textual CommunitiesWelcome to the temporary home of Version 2 of Textual Communities ("TC"), at <a href="http://textcomtest.usask.ca/">textcomtest.usask.ca</a>. This address will change when we are ready to go fully public with TC. Until then, this is a sandbox version, and all data may disappear at any point.<br />
If you just want to see what TC can do: choose a community from "Public Communities", and "View".<br />
<h3>
</h3>
<h3>
Sample files</h3>
<div>
You can get the sample files used in this documentation at <a href="http://www.sd-editions.com/tc">www.sd-editions.com/tc</a>. You can download all the files in this directory in a single zipfile at <a href="http://www.sd-editions.com/tc/tcstart.zip">www.sd-editions.com/tc/tcstart.zip</a>. </div>
<h3>
</h3>
<h3>
Logging in</h3>
<div>
Here is what you see:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAyyay_NtaafzGKLUbatnKdZk8-WDcS7Bfino5qiEm2Ut1ZDHcoITCm_6TUo_GSyEOca6OKBTRZbGj2UKiXcXtqZPkk-meaAtPvSTlZIbHkZZgg_HdONnOc1WmVL251X7qzsSJkqGi2w/s1600/start.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="712" data-original-width="1600" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAyyay_NtaafzGKLUbatnKdZk8-WDcS7Bfino5qiEm2Ut1ZDHcoITCm_6TUo_GSyEOca6OKBTRZbGj2UKiXcXtqZPkk-meaAtPvSTlZIbHkZZgg_HdONnOc1WmVL251X7qzsSJkqGi2w/s400/start.png" width="400" /></a></div>
<div>
<br /></div>
<div>
Press the inviting "Start" button, and you will be asked to log in by social media, or create a log-in using your email address. If you do the latter, you will be sent an email to that address to confirm your registration. (Note: TC uses email addresses to uniquely identify each user).</div>
<div>
<h3>
</h3>
<h3>
Creating or joining a community</h3>
</div>
<div>
When you first log in as a new user, the Start button has changed:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidVlGm5VY-RvVSNkjbhW8zWU7zZMooMw8lILJtgPCKDAK6OdH1u-1VHS1K_g-hw2wSLVGwois4GLOGtcRMUf0B_jGjvu0_6S5e7N5KgKQjSz3WgIGNlZ2pIDPHt5uSUHtY1nw1QIpDOA/s1600/createorjoin.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="326" data-original-width="1600" height="80" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidVlGm5VY-RvVSNkjbhW8zWU7zZMooMw8lILJtgPCKDAK6OdH1u-1VHS1K_g-hw2wSLVGwois4GLOGtcRMUf0B_jGjvu0_6S5e7N5KgKQjSz3WgIGNlZ2pIDPHt5uSUHtY1nw1QIpDOA/s400/createorjoin.png" width="400" /></a></div>
<div>
<br /></div>
<div>
The "Create Community" button brings you to this screen:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2hhrbO1SJNnL1869MU5JHuGF6-OrfgpU47taBNqOEcA_0f8nm0ZHjXw_51mLrIJXB_wa0yVHUUWm6HWRum4oW27E_EZH36OLOJx2DrPktmHxmaYFOctwIWouIg7cZ5Q3nf3zWABeRZw/s1600/createcomm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1128" data-original-width="1600" height="281" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2hhrbO1SJNnL1869MU5JHuGF6-OrfgpU47taBNqOEcA_0f8nm0ZHjXw_51mLrIJXB_wa0yVHUUWm6HWRum4oW27E_EZH36OLOJx2DrPktmHxmaYFOctwIWouIg7cZ5Q3nf3zWABeRZw/s400/createcomm.png" width="400" /></a></div>
<div>
<br /></div>
<div>
The two compulsory fields, "Name" and "Abbreviation", are marked with *. Note the accessibility options: you can hide your community from everyone, or allow anyone to do anything, and many options in between.<br />
<br /></div>
<h3>
Your first document: an XML file</h3>
<div>
Once you have a community, you need documents! The "Start" button at the centre of the screen has changed again:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJabXCDOAl0bDVmJdoO8MYAUq2bZepZze_ZQ9DuIwdAFo_S-DYZtfXY9aQAwuLHMqR46caLTeZuVqu4LZwjl5UPq1WSDMgFYg4nxhg_GMOAECgbUpheSztH47Eu_8AqL_zNdtPqW3kJQ/s1600/firstdocument.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="268" data-original-width="1474" height="72" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJabXCDOAl0bDVmJdoO8MYAUq2bZepZze_ZQ9DuIwdAFo_S-DYZtfXY9aQAwuLHMqR46caLTeZuVqu4LZwjl5UPq1WSDMgFYg4nxhg_GMOAECgbUpheSztH47Eu_8AqL_zNdtPqW3kJQ/s400/firstdocument.png" width="400" /></a></div>
<div>
<br /></div>
<div>
Choose "Add Document" and you are offered two choices:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRqr6rZ6b40F58yR21pHYLJ6c3Olmo0HV09AnUUSxWEfaKMhK58tMxYwJZEU-L5CkJaG5ZgFriFfOiHkIkNb-H4ay7lPo11s0lXmCVDxqyJZ_aoXcjxzpffzVGubAljT5l2MOO8DV7Yg/s1600/adddoc.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="388" data-original-width="1084" height="114" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRqr6rZ6b40F58yR21pHYLJ6c3Olmo0HV09AnUUSxWEfaKMhK58tMxYwJZEU-L5CkJaG5ZgFriFfOiHkIkNb-H4ay7lPo11s0lXmCVDxqyJZ_aoXcjxzpffzVGubAljT5l2MOO8DV7Yg/s320/adddoc.png" width="320" /></a></div>
<div>
<br /></div>
<div>
This time, select the "XML file" option. TC likes TEI! Here is a very simple example of a TEI/XML file, optimized for TC use:</div>
<div>
<span style="font-size: x-small;"><br /></span>
<span style="font-size: x-small;"><?xml version="1.0" ?> </span><br />
<span style="font-size: x-small;"><TEI xmlns="http://www.tei-c.org/ns/1.0"></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><teiHeader></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><fileDesc></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><titleStmt><title>Fairfax</title></titleStmt></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><publicationStmt></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><p>Draft for Textual Communities site (spelling modernized)</p></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span></publicationStmt></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><sourceDesc><p>Murray McGillivray</p></sourceDesc></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span></fileDesc></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span></teiHeader></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><text></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span> <body></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span> <pb n="130r" facs="FF130R.JPG"/></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><div n="Book of the Duchess"></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"><span style="white-space: pre;"> </span> </span><lb/><head n="Title">The book of the Duchesse</head></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="1">I Have great wonder/ be this light</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="2">How that I live/ for day nor night</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="3">I may nat slepe/ wel nigh nought</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="4">I have so many/ an idel thought</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="5">Purely/ for default of sleep</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="6">That by my truthe/ I take no keep</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="7">Of no thing/ how it cometh or goth</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="8">Ne me is no thing/ leief nor loth</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="9">Al is y like good / to me</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span><span style="white-space: pre;"> </span><lb/><l n="10">Joy or sorrow / where so it be</l></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span> </div></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span> </body></span><br />
<span style="font-size: x-small;"><span style="white-space: pre;"> </span> </text></span><br />
<span style="font-size: x-small;"></TEI></span><br />
<span style="font-size: x-small;"><br /></span>
There are a few things to note about this file:<br />
<ul>
<li>"Content" elements with "n" attributes (<l n="1">) are especially important to TC. TC uses these to identify all content sections. Thus: the first line is labelled by TC as "div=Book of the Duchess:l=1", and TC then uses this identifier to locate all versions of the first line in every document</li>
<li>Note the explicit use of <lb/> elements to mark each document new line. TC uses the implicit hierarchy of page, column and line breaks (<pb/> <cb/> <lb/>) to construct a "text-tree" for each document, alongside the "text-tree" it creates for the hierarchy of <div> and <l> elements.</li>
</ul>
TC's understanding, that every text is composed of two distinct text-trees, one for the document (<pb/> <lb/> etc) and one for the act of communication represented in the document (<div>, <l> etc), is what separates TC from other systems for creating scholarly editions.<br />
<br />
<h3>
Adding more documents, adding images</h3>
</div>
<div>
After selecting "XML file" you will get this dialogue:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3mQE8RcZ53xZ3sQig6FXIpmiZnFK9diZqMIjzNu4RRRRNGLgdJaWXrAoMWB_bUy8BnN8tVizKOxWhP-mEWtMLRNdO2psV1VBBiBvJTTAPuQVuG0rFUQvDfQl1ce_PHIF8EBhbUtapTQ/s1600/loadxml.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="375" data-original-width="856" height="139" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3mQE8RcZ53xZ3sQig6FXIpmiZnFK9diZqMIjzNu4RRRRNGLgdJaWXrAoMWB_bUy8BnN8tVizKOxWhP-mEWtMLRNdO2psV1VBBiBvJTTAPuQVuG0rFUQvDfQl1ce_PHIF8EBhbUtapTQ/s320/loadxml.png" width="320" /></a></div>
<div style="clear: both; text-align: center;">
<br /></div>
<div>
Choose the file "Fairfax.xml" from the sample files (see above), give it the name "Ff" (or similar), and press "Load".</div>
<div>
You will receive various encouraging messages, and the window should change to show you the sigil for this manuscript in the left hand pane:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHgRMvfOBtKeT5uFx6PHEpVap_O-MkPa0Z-Lrdfwui7I77FUKiv2Zd8wc6RR1CZBUABMDb8PruNEYDZ1DgGE82WInOfmYExKsUgL4uTxzxgFyZLjHTRrrtYXudX2U4OON0h9BTd6aFCg/s1600/leftpaneinit.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="262" data-original-width="704" height="119" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHgRMvfOBtKeT5uFx6PHEpVap_O-MkPa0Z-Lrdfwui7I77FUKiv2Zd8wc6RR1CZBUABMDb8PruNEYDZ1DgGE82WInOfmYExKsUgL4uTxzxgFyZLjHTRrrtYXudX2U4OON0h9BTd6aFCg/s320/leftpaneinit.png" width="320" /></a></div>
<div>
Click on the arrow beside Ff to see the pages in Ff, and then click on the first page. Its transcription will now appear in bottom right pane:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMVT9qAZTbdHoOyx_4lFoVQkH8XO-aqj84kNFCILDpK4wTVPzmfqumMKW_uEILUW61vGM549bk-5KaO8E4vFkwTmq1J0kBZoPYhfW0xsRvJVbxWYlCAy6lmIO96nYXWxUSDviuPZeeoA/s1600/transcriptView.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="819" data-original-width="1600" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMVT9qAZTbdHoOyx_4lFoVQkH8XO-aqj84kNFCILDpK4wTVPzmfqumMKW_uEILUW61vGM549bk-5KaO8E4vFkwTmq1J0kBZoPYhfW0xsRvJVbxWYlCAy6lmIO96nYXWxUSDviuPZeeoA/s640/transcriptView.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
Now, you can add an image to the page. You can do this in several ways:</div>
<div>
<ul>
<li>Click on the "Add Image" button in the top-right pane, or the camera icon beside the page number "130r". You will get a box inviting you to choose an image file or drop it onto the dialogue. Choose FF130R.JPG from the sample files.</li>
<li>You can load multiple images by putting them all in a folder, zipping the folder, and then clicking on the ZIP icon next to the manuscript name. Choose FairfaxImages.zip from the sample files.</li>
</ul>
In either case, you will see the image appear in the top right pane. The red camera icon beside each page which now has an image will turn black. If you have all the images for the manuscript, the multiple image icon (two cameras above one another) will also turn black:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilWjX6-gtAXwha4GPM3y48qLLCc0-j8PBT6SpSOwGjMa2foiNw11KDpDKm9VaWAi9qHCPOAakEGzqOoZd8PqWMxGdEzgVAJYl73iZjNXviLI5EIFpqUlPxsCbnnaguFieF8Oh5u25kYQ/s1600/transcriptViewImage.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="389" data-original-width="1600" height="153" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilWjX6-gtAXwha4GPM3y48qLLCc0-j8PBT6SpSOwGjMa2foiNw11KDpDKm9VaWAi9qHCPOAakEGzqOoZd8PqWMxGdEzgVAJYl73iZjNXviLI5EIFpqUlPxsCbnnaguFieF8Oh5u25kYQ/s640/transcriptViewImage.png" width="640" /></a></div>
<div>
<br /></div>
<div>
Play around with the other icons on this page. Try pressing the "Save" "Preview" and "Commit" buttons, to see what happens. (Note: "Commit" will write the page to the underlying database.)<br />
Add another document by clicking on the <span style="background-color: blue; color: white; font-size: x-large;">+</span> icon in the left hand pane. Again, choose the "XML file" option, this time add "Bodley.xml" from the sample files, with the name Bd.<br />
<br />
<h3>
Collation</h3>
</div>
<div>
The power of Textual Communities may be seen in the Collation system. At the top of the left panel, click the "Collation" tab:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5keAgEsJ0D8C27ifPBVLLXZ1YTt0lqiBv7nr_adZP2gJuhD8xdZNE_d6HaQceczNdmPhNJjUWVQHq2Sx0FOLkkQMWTLVfrIb1lS227LL8fd5QaMvy6-rNAGr5WRIYl1N-hexddbTlJA/s1600/collatehead.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="262" data-original-width="1580" height="66" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5keAgEsJ0D8C27ifPBVLLXZ1YTt0lqiBv7nr_adZP2gJuhD8xdZNE_d6HaQceczNdmPhNJjUWVQHq2Sx0FOLkkQMWTLVfrIb1lS227LL8fd5QaMvy6-rNAGr5WRIYl1N-hexddbTlJA/s400/collatehead.png" width="400" /></a></div>
<div>
<br /></div>
<div>
In TC terms, an "entity" is a discrete segment of an act of communication: a line of poetry, a paragraph of prose. Click on the arrow beside "Book of the Duchess" to open up the entities (lines of poetry) within it:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6TLfZy55vnBuoXUGDuqdqPmA4cB5lhDCWB-gGfXSyMfJynmwb9pwiQ4P12v-0WX-274HcQO80JskSMFx5_iteLtVzQOWffoL7AvKp72oU4MaU6pFoj3FahHggJ1h15F15mOEIP6M7rA/s1600/collsubents.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="302" data-original-width="444" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6TLfZy55vnBuoXUGDuqdqPmA4cB5lhDCWB-gGfXSyMfJynmwb9pwiQ4P12v-0WX-274HcQO80JskSMFx5_iteLtVzQOWffoL7AvKp72oU4MaU6pFoj3FahHggJ1h15F15mOEIP6M7rA/s200/collsubents.png" width="200" /></a></div>
(The order of these may vary.) Now, click on one of these lines. You will get this advice:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE1va1rqQvv52GQPI0yjlH3v9bp2jH1qzuJC8GQDITlcGIFHKDW9CuTmz9bA2W2DzrOqXUHhseNwksLKkFkpz1NrP6BocppLq6pRujBKpjmVMBnkqy_C5x3CbBcAx4qVAzDs0dIaIlkg/s1600/choosecollbase.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="370" data-original-width="918" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE1va1rqQvv52GQPI0yjlH3v9bp2jH1qzuJC8GQDITlcGIFHKDW9CuTmz9bA2W2DzrOqXUHhseNwksLKkFkpz1NrP6BocppLq6pRujBKpjmVMBnkqy_C5x3CbBcAx4qVAzDs0dIaIlkg/s320/choosecollbase.png" width="320" /></a></div>
So, go to that menu:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvjwwCHcdDdZqIId64wob0XvWjRPrjbA08u7Ma30HuO6CMcrdCHdxNLFLR5PPd39BPezV0piP7ttfH1i266s0jpfFeACGr6_paQEnDsxxN6M9e2PID-M0CkLJsz9p-NPMzzvdBIDlhcg/s1600/choosecollbasebox.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="236" data-original-width="696" height="67" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvjwwCHcdDdZqIId64wob0XvWjRPrjbA08u7Ma30HuO6CMcrdCHdxNLFLR5PPd39BPezV0piP7ttfH1i266s0jpfFeACGr6_paQEnDsxxN6M9e2PID-M0CkLJsz9p-NPMzzvdBIDlhcg/s200/choosecollbasebox.png" width="200" /></a></div>
Choose a base text (it does not matter which). Now, go back to click on line 1 in the collation. The right hand panel will change, to present the wonderful Collation Editor (developed originally for the Greek New Testament editing projects at Münster and Birmingham):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3Vbn-3YYp0NLl421NdxlDTrvoeg6yScopS_2miRdaef7vcdMEE1bodXHIn0vBP8HlH0VOa4yAmyQutX-W40uHo0iZ8Izj2f9IsXpkKv7nfVytK77sJNg0yCz3crY_BzkN9ADzeP6okA/s1600/collation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1286" data-original-width="1148" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3Vbn-3YYp0NLl421NdxlDTrvoeg6yScopS_2miRdaef7vcdMEE1bodXHIn0vBP8HlH0VOa4yAmyQutX-W40uHo0iZ8Izj2f9IsXpkKv7nfVytK77sJNg0yCz3crY_BzkN9ADzeP6okA/s400/collation.png" width="356" /></a></div>
(You may need to make the window larger to see the menu at the bottom of the pane). Spend some time playing with this. You can regularize variants (e.g. remove the variant wonder/wondir) by dropping one word on another:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinB7nzqgodzsIpIRc1P97TQ2QcrWB52UNJ_JEOkU8PqGlGlV8H92TmMc5X5BTk4-6220K1sSovu4Lob2w43_xCBcXM_-EwSjEqeom6us8ob8A1HbNagKSb1W_YmtCdjvgKfp7viMmHiw/s1600/regularize1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="690" data-original-width="730" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinB7nzqgodzsIpIRc1P97TQ2QcrWB52UNJ_JEOkU8PqGlGlV8H92TmMc5X5BTk4-6220K1sSovu4Lob2w43_xCBcXM_-EwSjEqeom6us8ob8A1HbNagKSb1W_YmtCdjvgKfp7viMmHiw/s200/regularize1.png" width="200" /></a></div>
After choosing "Save", you will see that both manuscripts now have the reading "wonder":<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyIQvu6cqU5QPuZMHnjGe2XVnGGFTaIB5z1wkJ7xu2ABRLWC7q71T1e1l4PWtBMz7pduJeXlCRWf6spG_QGYpHUCIux4h4KmfNnwKg8RUw4k0Iybn1JCSTpKS3CJ733J0NL38bZUIWiA/s1600/regularization.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="310" data-original-width="658" height="93" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyIQvu6cqU5QPuZMHnjGe2XVnGGFTaIB5z1wkJ7xu2ABRLWC7q71T1e1l4PWtBMz7pduJeXlCRWf6spG_QGYpHUCIux4h4KmfNnwKg8RUw4k0Iybn1JCSTpKS3CJ733J0NL38bZUIWiA/s200/regularization.png" width="200" /></a></div>
Play with the settings menu. You can change how the collation works from this menu:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitYRY4ffr4EeGPb5rxwxVy5zm-VDSo4uBKf0YoAu0uos-AsuuX9R33gsmdFyROQZXPcT2Gc2igINkeFLMKXHgtkpqO8pYJwdrRZ0S_w0zmJD10sM56RNrO2B0HBTdTLHISUWAALWfhig/s1600/regsettings.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="520" data-original-width="770" height="216" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitYRY4ffr4EeGPb5rxwxVy5zm-VDSo4uBKf0YoAu0uos-AsuuX9R33gsmdFyROQZXPcT2Gc2igINkeFLMKXHgtkpqO8pYJwdrRZ0S_w0zmJD10sM56RNrO2B0HBTdTLHISUWAALWfhig/s320/regsettings.png" width="320" /></a></div>
You will see how the collation changes as these selections change.<br />
This brief introduction gives only a glimpse of the power of the Collation Editor. Try the following, for example:<br />
<br />
<ol>
<li>Go back to one of the documents, change line 1, commit the change (this writes it to the database used by the collation), and return to the collation. You will see your change there.</li>
<li>Now, for fun: go to the second page of Ff (130v) and have line 38 continue from the previous page onto this page and add something to it. Hint: change the "From previous page" value:</li>
</ol>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuEKzkyTNqBig-_wdDZNxefYpkH6srRKeFpqA7-wJb6mAr0AmB74oXdHIP6Fy1AzKkGMBOn5hvtiD4wJoa5quBNo73vAjgqpND_e0jK8oeQ7QEPcKgfkQKJT2AsLGvvQ8KSiDHo-nlOQ/s1600/FFcontinue+line.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="232" data-original-width="1600" height="91" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuEKzkyTNqBig-_wdDZNxefYpkH6srRKeFpqA7-wJb6mAr0AmB74oXdHIP6Fy1AzKkGMBOn5hvtiD4wJoa5quBNo73vAjgqpND_e0jK8oeQ7QEPcKgfkQKJT2AsLGvvQ8KSiDHo-nlOQ/s640/FFcontinue+line.png" width="640" /></a></div>
<div>
<br /></div>
Then, commit this change and return to the collation. You will see that line 38 now includes this extra text, across the page break. You can view the XML for this page by clicking on the XML icon beside the manuscript name, to comfirm that the line indeed continues across the page break:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEMzLCPUWh_v1XvBLDAHIKuKEtdpDYEYR7WFcEqgY1BU2QwdR5pxhMpoq0D8_NLVjlLcF2ywDQDK5tR4__hbPpeUXGqMoUbOZNW9gNSJWQkBpffubwuA4I9jRLm40d6eJ2YDCWMgvMGQ/s1600/xmlff.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="182" data-original-width="758" height="76" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEMzLCPUWh_v1XvBLDAHIKuKEtdpDYEYR7WFcEqgY1BU2QwdR5pxhMpoq0D8_NLVjlLcF2ywDQDK5tR4__hbPpeUXGqMoUbOZNW9gNSJWQkBpffubwuA4I9jRLm40d6eJ2YDCWMgvMGQ/s320/xmlff.png" width="320" /></a></div>
<div>
<br /></div>
<div>
<h3>
Other facilities</h3>
<div>
There is a great deal more in TC than this sketch shows. It is particularly rich in community management features, as follows:</div>
<div>
<ol>
<li>You can invite other people to become members of your community (click on the "Members" link when you have chosen your community, or on the "Member profile" item on the log-in menu) and follow the "Invite" link</li>
<li>You can change the status of any member, assign them pages to transcribe, check the progress of the transcription, assign them someone to approve their transcripts (the "Members" link for each community you lead)</li>
<li>You can permit other people to join your community without need of your approval, or require that anyone who wants to join must be approved by you ("Member profile" on the log-in menu)</li>
</ol>
Further, you can permit anyone to access pages, whole documents, or any part of the text of any document, and import it to their own website.<br />
<br /></div>
<h3>
Copyright, etc.</h3>
<div>
We encourage anyone contributing materials to TC to make these available under the Creative Commons Attribution (CC-A) license. That is: no share-alike and no "non-commercial" restrictions. This means there should no restrictions at all except requiring all subsequent users of the material to acknowledge your part in making it.</div>
<div>
For the time being: TC will accept materials which do have restrictions on them. However, it is likely that TC in future will require that all materials held on TC servers are free of all restrictions (CC-A or similar). This is because TC uses University of Saskatchewan and Compute Canada servers. As both are publicly funded, hosting materials with any kind of copyright restrictions raises legal and ethical issues.</div>
<div>
If this is a problem for you, you should not use TC.</div>
<div>
<br /></div>
<div>
<h3>
Some interesting features of TC</h3>
</div>
<div>
Here, in no particular order, are some aspects of TC which make it unusual, even unique:</div>
<div>
<ul>
<li>TC is built on an explicit ontology of texts, documents and works. Various of my publications describe this ontology (see <a href="https://www.academia.edu/12297061/Some_principles_for_the_making_of_collaborative_scholarly_editions_in_digital_form">https://www.academia.edu/12297061/Some_principles_for_the_making_of_collaborative_scholarly_editions_in_digital_form</a>; <a href="https://www.academia.edu/9575974/The_Concept_of_the_Work_in_the_Digital_Age_published_version_">https://www.academia.edu/9575974/The_Concept_of_the_Work_in_the_Digital_Age_published_version_</a>; <a href="https://www.academia.edu/3233227/Towards_a_Theory_of_Digital_Editions">https://www.academia.edu/3233227/Towards_a_Theory_of_Digital_Editions</a>). Briefly: TC sees text as a collection of leaves, with all leaves present on two distinct trees, each of which conforms precisely to the "OHCO" (ordered hierarchy of content objects) model. One of the trees represents the document (codex/quires/pages/columns/lines). The other tree represents the act of communication ("entity") inscribed in the document: as Play/Scenes/Acts/Lines, or Poem/Stanzas/Lines, etc. Note that this is not simply a matter of "overlapping hierarchies", as usually characterized. It is actually two quite distinct trees: distinct to the point that branches and their leaves might appear with quite different orders on the two trees (as in the case of notes or alterations spanning across the margins of multiple pages, etc.) Broadly, TC uses the 'document' tree to display the document page by page, line by line, and TC uses the 'entity' tree to locate units of text across multiple documents for collation.</li>
<li>XML and all the tools associated with it famously supports "one text, one tree". (Long ago, XML's predecessor SGML did attempt to enable multiple trees in any one text through the CONCUR feature. I never did discover a useful implementation of CONCUR.) Over some twenty-five years, I have tried to manipulate the two hierarchies using a variety of tools (most prominently, the Anastasia publishing system). One problem was that for long I thought the problem was simply "overlapping hierarchies", and not the more demanding scenario of two distinct trees. Another problem was the inefficiency of XML tools. Accordingly, while TC uses XML as its standard input format, it creates the two distinct trees from the XML and then stores the two trees not as XML but as a series of JSON documents stored in a MongoDB backend. In essence, the text is a collection of leaves stored in JSON fields, with each leaf also stored in distinct JSON documents representing the two trees. Over the last decade I have attempted to express this model with three different database systems: first, XML in the form of XML-DB; then SQL in a relational database (underlying the first version of TC, still to be seen at www.textualcommunities.usask.ca), and finally JSON. JSON wins. A key reason for the success of JSON was the requirement that we be able to edit pages in real time: that is, take out a chunk of each tree, rebuild both trees as needed and then reattach the leaves of text to each rebuilt tree, all while the editor watches. Doing this in real time is like gathering leaves in a howling gale. As a bonus, JSON (much more than XML) is the native language of web content, with an immense range of Javascript/HTML tools available to process it.</li>
<li>Technically: TC is build in pure javascript, using node.js and npm tools (<a href="https://nodejs.org/en/">https://nodejs.org/en/</a>; <a href="https://www.npmjs.com/">https://www.npmjs.com/</a>), for both server and browser components. This makes maintenance, etc, far easier. TC also uses the Angular framework to provide all interface components (<a href="https://angularjs.org/">https://angularjs.org/</a>; drawing on the Bootstrap and JQuery libraries). This architecture was designed by Xiaohan Zhang between 2012 (when we realized that the SQL solution would not work) and 2015. All code is freely available on Github, at <a href="https://github.com/DigitalResearchCentre/tc">https://github.com/DigitalResearchCentre/tc</a>.</li>
<li>Theoretically: there is no limit to the number of trees structuring every text. TC supports two. Best of British luck to whoever wants to deal with more than two.</li>
<li>TC uses a IIIF server and viewer software (<a href="http://iiif.io/)">http://iiif.io/)</a>. In the future, we want to broaden our support for IIIF, to import full IIIF documents, etc.</li>
<li>We would like to be obsolete very very soon. Someone please do this better than we did.</li>
</ul>
</div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br /></div>
PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-62375770125290364402014-09-29T06:44:00.002-07:002014-09-29T16:42:13.745-07:00The history of Collate<h2>
Historical note</h2>
<div>
This post was first published as part of the series of blogs detailing the move from Collate 2 to CollateX. I reproduce it here, in the run-up to the Munster Collation Summit on October 3 and 4, 2014, which might formally mark the final, irrevocable, irredeemable death of Collate 0 - 1- 2.</div>
<div>
<br /></div>
<div>
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<br />
<div class="MsoNormal">
<i><span lang="IT">February
6, 2007<o:p></o:p></span></i></div>
<div class="MsoNormal">
<b><span lang="IT">The
History of Collate<o:p></o:p></span></b></div>
<div class="MsoNormal">
<b><span lang="IT"><br /></span></b></div>
<div class="MsoNormal">
<span lang="EN-GB">Filed
under: History, Anastasia: finding another — Peter @ 9:29 am <o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB"><br /></span></div>
<strong><span lang="EN-GB">Collate 0 — Collate
1 — Collate 2</span></strong><span lang="EN-GB"><o:p></o:p></span><br />
<strong><span lang="EN-GB"><br /></span></strong>
<span lang="EN-GB">There have actually been
three versions of Collate, up to now. The very first, Collate 0 if you like, I
wrote in Spitbol on the DEC Vax in Oxford between 1986 and 1989. I wrote this
to collate 44 manuscripts of the Old Norse narrative sequence Svipdagsmal,
which I was editing for my doctoral thesis. I prepared full transcripts of each
manuscript on a Macintosh computer, and then transferred them to the Vax
(itself, I remember, not so straightforward a task in those days of the floppy
disc). I collated the transcripts using the Spitbol program, and created
various kinds of output. One of these outputs became the apparatus for the
critical edition included in my thesis. Another output was translated into a
relational database, which I used to explore the relationships between the
manuscripts. To optimize this, information about just what manuscript had what
variant in the database was held in a matrix, with rows representing each
variant, and columns representing each manuscript. Thus: <o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">A B C<br />
1 0 1<br />
1 1 0<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">showed that manuscripts A
and C agree at the first variant (both having variant ‘1′, while B has ‘0′);
manuscripts A and B agree at the second variant (both having variant ‘1′ while
C has ‘0′). This matrix has a historical importance: this was the data given to
the participants in the ‘textual criticism challenge’ of 1991 which
established, firstly, that phylogenetic methods were far ahead of any other
kinds of analysis in applicabililly to the analysis of textual traditions, and,
secondly, that phylogenetic analysis could prove genuinely useful in
establishing historical relations within a textual tradition.<o:p></o:p></span><br />
<span lang="EN-GB">Collate 0 consisted of
around 1200 lines of Spitbol code. Spitbol was (and is: versions of the program
are still maintained) a rather beautiful program, built around pattern-matching
algorithms. It had some very neat string matching and storage facilities (including,
a nifty table facility with hash and key tools). You could write functions
within it, but by modern standards, its data models were crude: everything was
a string, and that was that. Oxford was then a stronghold of Spitbol (and
Snobol) programming: Susan Hockey taught a course in Snobol (I think) and I
remember many animated discussions with her and with Lou Burnard about what I
was trying to do.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">Collate 0 established
several approaches to collation which I retained in the later versions of
Collate, and which indeed will (I think) be part of CollateXML:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal"><span lang="EN-GB">Collation should be based on full
transcripts of the manuscripts. </span><span lang="IT">This seems obvious
now; it was less so then</span></li>
<li class="MsoNormal"><span lang="EN-GB">One should collate all the versions at
once, at the same time, rather than (say) running many pair-wise
comparisons and then melding the many comparisions into one<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">The text needed to be divided into
collateable blocks. This required some system of marking the blocks: I
adopted the COCOA system, then used by the Oxford Concordance Program, for
this<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">Other textual features (notably,
abbreviation) needed markup<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">Some kind of regularization facility was
needed to filter out ’spelling’ from ’substantive’ variation<o:p></o:p></span></li>
</ol>
<span lang="EN-GB">Collate 0 was successful in
two key ways:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal"><span lang="EN-GB">I managed to finish my thesis, and got my
doctorate, despite spending countless hours, often in the dead of the
night, deep in Oxford University Computing Services on 6 Banbury Road (and
briefly in a OUCS annex in South Parks road) peering at the green symbols
on the darkened Vax terminal, and endlessly tinkering with and re-running
the Spitbol program<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">I wrote two articles for <em>Literary and
Linguistic Computin</em>g about this work. On the strength of these, and
with Susan Hockey’s guidance and help, I submitted a grant application to
the Leverhulme Trust to carry on this work.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB">This grant proposal was
successful, and in September 1989 I started work on what became Collate 1. Only
one person had ever used, and probably could ever use, Collate 0: me. Rather a
lot of computer programs, I have since discovered, are only ever used by the
person who wrote them (including, indeed, some made with much public money).
Our proposal to the Leverhulme trust specified that our collation tool could be
used by many other people. This meant a real graphical user interface, not the
command-line tool which Collate 0 was. Indeed, one needed a graphic interface
because I was by then convinced (and, i still believe) that scholarly collation
is an interactive activity. I found that in Collate 0 I spent endless hours
manipulating the collation output by tinkering with the program itself, and by
compiling complex regularization tables to smooth out idiosyncratic spellings
from the tables. This was extremely clumsy. I determined that in Collate 1, we
would have the computer make a first guess at the collation for any part of the
text, a block at a time. The scholar would examine that collation, and then
intervene in a point-and-click way to adjust the collation as needed. For
medieval texts, some form of spelling regularization was required. In Collate 0
the regularizations were held in separate files, which were loaded at runtime:
so you had to run the collation, look at the results, see what needed to be
changed, open and edit the files (with a VI line editor, no easy thing), then
reload and run the collation again — and so on. In Collate 1, I wanted to point
at what word we wanted regularized to what, and to see the result
instantaneously. Similarly, I now knew that any automatic system was going to
make decisions about precisely what collated with what which a scholar would
find unsatisfactory. Take the collation of ‘a cat’ against ‘cat’. Should we
regard this as replacement of one word by a phrase, or of identify of one word
(’cat’) in each souce and addition of another word (’a') in one source? In
Collate 0, such intervention was done in the nastiest possible way: by
hardwiring various gotchas into the collation code itself. In Collate 1, this
should be done again by some kind of user-intervention, working in a graphic
userface.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">This was September 1989 and
if you wanted to make a program for personal computers with interactive
point-and-click facililities there was only one choice for it: the Macintosh.
Microsoft had attempted two versions of Windows up to then, but neither
appeared sufficiently stable for a neophyte programmer. By comparision,
programmer tools for the Mac were well advanced. Also, I knew Macintosh
computers very well, as I had used a succession of Macs for writing my thesis.
Apple Computer donated a Macintosh SE (I think) to the project, we purchased a
C programming compiler — Lightspeed C, which became Think C quite soon — and we
were started. In the early days we did not even have a hard disc. The SE had
two floppy disk drives, which made it a truly luxurious machine in those days:
you could have the program and some data on one floppy disc drive, and the
operating system and other data on another floppy disc drive. Much of the time
was spent juggling data and programs between discs, ejecting and inserting disc
after disc, sometimes hundreds of times a day (so much so, that someone even
adapted the pop-up mechanism from a toaster to automate insertion and removal
of discs).<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">The choice of C meant a
complete ground-up rewrite of the program, within a windows/icons/menus/pointer
(WIMP!) environment. So Collate 1 began, with the first versions released in
1991. This retained the fundamental features of Collate 0 referred to above
(collation by blocks based on full transcripts, basic markup) with newer tools:
a ‘live’ collation mode combined with point-and-click adjustment of
regularization and setting variants; expanded and more flexible markup,
including notation of layout features such as pages, columns, lines and text
ornamentation; output formatted for TeX processing using the Edmac macros for
complex critical edition layout. In a series of talks in 1990 and 1991 — at the
New Chaucer Society conference in Canterbury; the ALLC conference in Phoenix,
Arizona; in Austin, Texas; at Georgetown University in Washington; at the Society
for Textual Scholarship in New York; especially, at the CHUG meeting in
Providence — I described the unfolding Collate, and recruited its first
enthusiatic and hopeful users. Some of these users are still with Collate, many
years on: Don Reiman and Neil Fraistat incorporated it into the work they did
on their Johns Hopkins Shelley edition; hardly a week since has passed without
a message (admonitory, exhortatory, or plain friendly) from Michael Stone; and
after fifteen years Prue Shaw was finally able in 2006 to publish her edition
of Dante’s <em>Monarchia</em>, built with Collate.<o:p></o:p></span><br />
<span lang="EN-GB">Collate 1 established the
user interface still basic to the current Collate 2, which has retained all the
major features outlined above. Collate 2 also is built on the same C code as Collate
1. There is no ‘clean break’ between Collate 1 and 2 as there is between
Collate 0 (written in Spitbol) and Collate 1 (written in C) — and as there will
be between the current Collate 2 and its successor (which I now think of as
CollateXML, and which I now contemplate will be written in Java, ‘now’ being
January 2007). However, various developments in the early 1990s led to such a
drastic reshaping and enlargement of Collate 1 that I came to think of this as
‘Collate 2′. </span><span lang="IT">These developments, in no special order,
were:</span><br />
<ol start="1" type="1">
<li class="MsoNormal"><span lang="EN-GB">The onset of the Text Encoding Initiative.
Oxford, through Susan Hockey and Lou Burnard (in those days, the Tony
Blair and Gordon Brown of UK humanities computing), was the European
leader of the TEI. I found myself drawn into the TEI orbit, even becoming
the absurdly underqualified chair of the Scholarly Apparatus workgroup
(which included Robin Cover, Ian Lancashire, Bob Kraft and Peter
Shillingsburg, so you can see how junior I should have felt). I also
attended meetings of the primary source transcription workgroup, though
for some reason this has never been recognized in the TEI documentation,
and I ended up writing almost the whole of the chapters on textual
apparatus and transcription encoding in the TEI (though again, this has
never been clearly acknowledged). Through the TEI I learnt about SGML, and
became completely convinced that structural markup (though not
hierarchical markup) is key to useful scholarly work in the digital age.<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">The appearance of the web. Oxford was one
of the very first sites to mount a web server (as early as late 1992, if I
recall rightly) and I attended the first web conference, held at CERN in
April 1994, when the web was still small enough for a meeting of server
administrators to be held under a tree on the lawn outside the CERN
lecture halls.<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">The development of the Canterbury Tales
project. In our proposal to the Leverhulme Trust we stated that we would
use the manuscritps of the Wife of Bath’s Prologue as test material. Susan
Hockey and I did not think very deeply about this choice: we were just
looking for something that was not Old Norse (our other choice of test
material was the Old Norse <em>Solarljod</em> — and this year, finally, my
and Carolyne Larrington’s edition of this should appear in the massive new
edition of Old Norse skaldic poetry), which was in about the right number
of manuscripts, seemed to present interesting problems, and would be fun
to work with.<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">The demands of other Collate users. The
key group here was the Institute for New Testament Research, Munster. I
first met this group in 1996: in 1997 I started working with them
intensively on the Nestle-Aland Greek New Testament, and through them met
David Parker and the scholars he was working with in Birmingham.<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">Collaboration with researchers in evolutionary
biology. I had already discovered the power of phylogenetic methods
through Robert O’Hara: particularly, his entry to the ‘textual criticism
challenge’ in 1991, showing how these methods worked with the Old Norse <em>Svipdagsmal</em>
tradition. Robert and I developed this into several articles but were
unable to carry it much further. However, in 1996 I met, through Linne
Mooney, Chris Howe of the Cambridge University Department of Molecular
Biology. As a professional evolutionary biologist, he was able to bring
many more resources to this enquiry — particularly, he brought in a series
of remarkable individual researchers to the work, each contributing new
perspectives.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB">In different ways, these
forced me to refine what Collate did, and to develop new capacities for it, to
such an extent that Collate became a new program. The key change was that I
came to think that the aim of Collate was not to help scholars prepare print
editions, but to help them make electronic editions. This had many
consequences. Particularly, it meant that Collate had to prepare materials for
inclusion in an electronic edition. This meant first of all SGML — and later,
XML and HTML. This meant also extended parsing facilities. I did not go so far
as adapting Collate to collate files fully encoded with SGML. Collate now had a
body of users with many files encoded in the Collate format and content to go
on using that format and I would have had considerable difficulty persuading
them to move over to full SGML. But I did tighten the Collate encoding model to
make it closer to SGML, and then added comrehensive facilities to translate
Collate encoded files to SGML (and also XML, HTML and other systems). I also
folded two full SGML parsers into the program: both Pierre Richard’s YASPMAC
and James Clarke’s SP. These were used particularly for translating SGML
encoded apparatus files into other forms, particularly into NEXUS files for
analysis by evolutionary biology programs.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">While these extended
Collate’s grasp, the requirements of its most demanding users forced it in
other directions. One of these demanding users was the Canterbury Tales
Project. As we moved onto larger sections of text, and particularly sections
where no two manuscripts had the same lines in the same order, I discovered we needed
a much more powerful system for dealing with witnesses which had the text
blocks in many different orders. ‘Block maps collation’ was, and is, Collate
2’s solution to this. But perhaps the biggest shift of all was one that many
users may not see at all. This is the adoption of ‘parallel segmentation
collation’, directly as a result of the experience of working with Munster
scholars and with evolutionary biologists. I explain at some length exactly how
these two groups led us to abandon the ‘base text collation’ we used before
1998 in favour of ‘parallel segmentation collation’ in the article ‘Collation
Rationale’ included in the Miller’s Tale CD-ROM.<o:p></o:p></span><br />
<span lang="EN-GB">Adopting this model forced
changes on many areas of the program: particularly, on the ‘Set Variants’ module,
and also on the kinds of analysis and variant display we could now achieve.
Perhaps most of all, it puts us in reach of a yet more sophisticated mode of
collation: what I describe as ‘multiple progressive alignment’ in the
‘Collation Rationale’ article. Briefly: once we have aligned the variation
across the witnesses into parallel segments, one could then go a step further
and analyse the witness groupings within the segments. This is standard
practice in analysis of variant DNA sequences in evolutionary biology but I
have not implemented this in Collate 2: here, indeed, is a task for the next
Collate.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">Collate 2 was formally
released in 1996, and has been continually refined since then. The development
of Collate 1 and 2 now spans over seventeen years, from late 1989 to 2007, and
there is C code within Collate dating back to the very beginning of Collate 1.
This is an eon in the software world. Further, what was a great benefit in the
software world in 1989 — the availabiltiy of the Macintosh interface for
interface programming — had by 2007 become a cul-de-sac. The introduction of
Macintosh OS X from 2000 on rendered the future of Macintosh Classic
applications very dubious. I could, in theory, port Collate to OS X and a few
times after 2000 I began to experiment with such a port. I discovered, very
quickly, that this would be a huge task. The Collate code has grown to around
180 files, amounting to around 120,000 lines of code. Perhaps most
discouraging: there are over 80 dialogue windows in Collate, managing the
user’s interaction with the program. Some of these — notably, the
regularization and set variants windows — have extremely complex execution
flows built in them, refined over more than a decade’s experience. One might
abandon some of these: but many of these windows would have to be hand-made
anew in the OS X environment. Further, OS X changed many aspects of the graphic
environment inhabited by Classic, and one would have to go through the code
line-by-line at some points changing the old for the new. Many of these changes
would involve complex reprogramming. And at the end: one would have a program
which still ran on only one operating system.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">Other things, too, had
changed. The mantra of ‘write once, run everywhere’ had taken root, and a new
generation of tools (notably, the Java programming environment) had arisen to
support this aim. It is now a real possibility to write a complex graphic user
interface program which runs identically, and as if native, on multiple
platforms. Further, the XML world has matured, with a speed that would seem
unimaginable to the very slow pace of development of applications for its
predecessor, SGML. And most decisively, perhaps: a model of open-source
collaborative programming has developed. All the time that I wrote Collate 1
and 2, the authoring model for software was modelled on that for books: a
single person wrote the software, and then it was sold. But since the mid 90s,
the open source movement, built on voluntary collaboration, has gathered pace.
This is particularly so in the university and research worlds, where the news
that you might even be considering writing software to sell is met with
disbelief — so that funding bodies routinely now insist that software code be
open source. Within the XML world too, another model of programming has also
developed: away from the all-inclusive this-application-will-do-it-all to a
federated world of individual co-operating programs. This is particularly true
in the web world: a simple user request may invoke one program to work out how
to respond, which then summons data from a relational database, combines this
with other data from an XML database (using XQuery and other X applications),
blends into XML, which an XML formatter then transforms to HTML, which the
server then passes back to the requester.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">This leaves us, then, with
a set of directions we can follow for CollateXML:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal"><span lang="EN-GB">It will have all the functionality of
Collate 2; particularly, it may support interactive user-adjustable
collation<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">It will be written in a modular form, so
that (for example) applications which want to use collation services but
not to offer interactive adjustment of collation can embed the collation
services in their own environment apart from the user interface<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">It will handle native XML, both with and
without a schema or DTD. However, it should employ its own data interface,
independent of XML, so that future or other markup languages (including,
indeed, the existing Collate markup) could be readily supported by the
program. I am known for predicting the demise of XML: an event which will
occur when computer science departments recognize that the overlapping
hierarchy problem is not a ‘residual’ difficulty, but a fundamental
feature of text.<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">It will be written co-operatively, in an
open source environment<o:p></o:p></span></li>
<li class="MsoNormal"><span lang="EN-GB">The best bet for its development appears
to be Java. The range of XML tools already offered by Java gives us an
excellent platform — as, too, the remarkable string-processing library
Java offers. Combine this with its high modularity, its excellent support
for graphic interfaces, and its popularity with XML developers (not least,
the eXist world) and we have an extremely compelling case.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB">So far, the history of
Collate.<o:p></o:p></span><br />
<span lang="EN-GB"><br /></span>
<span lang="EN-GB">All this means: the next
version of Collate must be open source.<o:p></o:p></span><br />
<!--EndFragment--></div>
PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com4tag:blogger.com,1999:blog-5774054219585481589.post-65383589080210776762014-09-29T06:39:00.000-07:002014-09-29T06:39:26.362-07:00Collate 2, and the design for its successor: CollateXML (now, CollateX)<h2>
Historical note: the contexts of this document</h2>
<div>
The article which follows was posted on the Scholarly Digital Editions blog in a series of entries from February to June 2007. There were two contexts for these original blog posts:</div>
<div>
<ol>
<li>The age, and impending death, of <i>Collate</i> <i>2</i>, the computer-assisted scholarly collation program I had started writing in the late 1980s. This had achieved some success, at least by the simple measure that it was one of the rare "humanities computing" (as we used to call it) computer programs used by people other than the person who wrote it. So we used it in the Canterbury Tales project to make some six digital editions; Prue Shaw used it to make her editions of Dante's <i>Monarchia</i> and <i>Commedia</i> (see <a href="http://www.sd-editions.com/">www.sd-editions.com</a>); the Greek New Testament editing projects at Munster and Birmingham made it the centre of their moves into the digital age; Michael Stolz used it for his edition of <i>Parzival </i>(<a href="http://www.parzival.unibe.ch/home.html)">www.parzival.unibe.ch/home.html)</a>. <i>Collate 2</i> was written for classic Macintosh computers, and the advent of OS X from 2002 on first cast a doubt on the future of the "classic" operating system, and then became a death sentence when Apple announced in 2005 that OS X was moving to Intel processors, and that when it did make this move, the classic system would die. Of course, I could have rewritten <i>Collate</i> for OS X. However, it was clear that this was no simple matter. The heart of <i>Collate</i> was (and is) a series of interactive routines, allowing the scholar to control the collation through multiple dialogue boxes: over a hundred of them, all told. OS X introduced a quite different (and vastly superior) model for handling interactive dialogues; every one of these venerable "event loops" would have to be rewritten. I could have done this, but by this time, I was aware that there were fundamental things which <i>Collate</i> could not do. Sometimes, you can renovate. Sometimes, you have to rebuild from the ground up.</li>
<li>As I was mulling this over from 2005 on, I began to talk to two people in particular who were interested in computer scholarly collation systems: Fotis Iannidis, the director of TextGrid, and Joris Van Zundert, engaged in software systems development at the Huygens Institute. Both Fotis and Joris had a vital scholarly interest in collation, and both might have access to resources (which might need to be considerable) to write a successor to Collate. In late January 2007, Joris convened a meeting in The Hague to discuss editing software systems, including collation; in early 2008, Fotis came to Birmingham to discuss how we might proceed. </li>
</ol>
This blog post was directly stimulated, then, by the meeting with Joris and others in 2007. It is an attempt to lay out what I thought might be fundamental to a useful successor to Collate. At the time, I said, half-jokingly, that it took me about five years on my own to write the first version of Collate; it would take ten people ten years to write its successor. Following these meetings, by various indirections, the InterEdition project (bringing together Huygens and TextGrid people) got started. A group loosely based within InterEdition took on what quickly became known as CollateX, with Ronald Dekker particularly taking on writing the core software routines. A look at the CollateX site, some seven years on, suggests that my idle prediction was not too far astray.</div>
<div>
<br /></div>
<div>
Now, the original blog posts, as posted between 12.58 pm GST, 5 February 2007 and 10.02 pm, 28 June 2007. I follow this with an email I sent to Fotis, Joris and others announcing these posts. Among other matters: the post of June 21 announces the new name: CollateX. Thus it has been since.</div>
<div>
<br /></div>
<div>
**************************************</div>
<div>
<br /></div>
<div>
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;}
</style>
<![endif]-->
<!--StartFragment-->
<div class="MsoNormal" style="text-align: justify;">
<i><span lang="EN-GB">February 5, 2007<o:p></o:p></span></i></div>
<div class="MsoNormal" style="text-align: justify;">
<b><span lang="EN-GB">The design of
CollateXML<o:p></o:p></span></b></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">Filed under: Designing Collate — Peter @ 12:58
pm<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">In this document I set out, as clearly as I
can, the various datastructures and operations which I think CollateXML will
require.<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">The fundamental design of CollateXML is this:<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"> 1. The input is various streams of
text, divided into marked collation blocks<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">
2. These various streams of text are located<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">
3. Within the streams of text, each corresponding block for collation
must be located<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">
4. The collation program creates two sets of collation information:<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"> 1. concerning the
different orderings of the blocks within the streams of text<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"> 2. concerning the
differences in the texts contained in the blocks themselves<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">The collation information is then formatted for
output</span></div>
<div class="MsoNormal" style="text-align: justify;">
A few observations:</div>
<div class="MsoNormal" style="text-align: justify;">
</div>
<ol>
<li>In Collate0-2,
the input was always computer files, held on the computer itself. In
CollateXML, the input could be ‘any text, anywhere’: from a database, local or remote;
from a URL anywhere.</li>
<li>The crucial
marking of collation blocks should be done through something like the
‘universal text identifier’ scheme I outlined at The Hague on 25 January 2007.</li>
<li>Collate0-2 did
only ‘word by word’ collation. This presumes that the texts are ‘word by word’
collatable: without very large areas of added, deleted, or transposed text. But
many texts have a different kind of relation: large portions of one text might
be embedded in another text, but other areas of the texts are very different
(the situation common in plagiarism, or ‘intertextuality’, for example).
Collate0-2 did not handle this situation; CollateXML should be able to do so.</li>
<li>CollateXML
should have its own internal data models for passing information both to and
from the collation process. These models should be exposed through an API to
programmers, who can then provide import and export for whatever formats they
choose.</li>
</ol>
<br />
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"> We can now begin to specify the
building blocks we need.<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"><br /></span></div>
<div class="MsoNormal">
<i style="mso-bidi-font-style: normal;"><span lang="IT">February
5, 2007<o:p></o:p></span></i></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="IT">How
CollateXML should work<o:p></o:p></span></b></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Filed
under: How CollateXML should work — Peter @ 1:18 pm <o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The separation of
collation into stages<o:p></o:p></span></strong><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">As Collate0-2 developed, I
learnt that one had to break the collation process into stages. At first,
Collate 0-2 simply collated, and identified the variants as it found them. I
soon learned that the complex requirements of scholarly collation demanded the
adjustment of the collation at various points. To do this, it became clear that
one had to separate out the stages of collation to permit intervention at various
points. However, this separation was grafted onto Collate0-2 in a a piecemeal
fashion. I propose that from the beginning, CollateXML separate out what appear
to me now as the following fundamental stages of collation:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">text alignment, one witness at a time
against the base<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">storage of alignment information for all
witnesses against the base<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">adjustment of alignment information for
all witnesses against each other<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">variant identification within the aligned
texts.<o:p></o:p></span></li>
</ol>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Text alignment</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The fundamental building
block is the alignment routine itself. Here is how I suggest this works for
word by word collation, based on how it worked in Collate 0-2.<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Each alignment act compares two texts of a
text block at once: a specified ‘base’ text and a witness text. starting
at the first word of the text blcok in each<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The alignment examines the first word: if
they are identical it returns this information; if there is a variant it
returns that information along with: the number of words matched in base
text and witness text. </span><span lang="IT">The possibilities are: </span></li>
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">same word in each. 1 word matched in
each; next word to match in base will be word 2; next word to match in
witness will be word 2<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">one word replacing one word in each. next
word to match in base will be word 2; next word to match in witness will
be word 2<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">word omitted in witness. next word to
match in base will be 2; next word to match in witness will be<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">word added in witness. next word to match
in base will be 1; next word to match in witness will be 2<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">phrase omission or addition: as for c and
d, but the next word to be matched in base or witness will be adjusted
accordingly<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level2 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 72.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">phrase replacement: if two words in the
base are replaced by three words in the witness: then the next word in
the base to be collated for this witness will be word 3; the next word to
be collated in the witness will be word 4.<o:p></o:p></span></li>
</ol>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This is, fundamentally,
rather simple. You can look over the C code for Collate 2 to see how we did
this. Essentially, at each alignment, Collate 2 carried out a series of tests,
till it got a match:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">are the next words identical?<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">are the next words a variant, and so align
against each other? Collate used a ‘fuzzy match’ algorithm for this.
Essentially, if the two words had more than 50% of their letters in common
(weighted according to the position of the letters) then Collate said,
these words align. Thus, Collate would see ‘cat’ and ‘mat’ as variants on
each other<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">it could be that while this word does not
match, the next word does. So Collate will look at ‘black cat’ and ‘white
cat’ and declare that ‘black’ and ‘white’ align, because the next word is
a match. Indeed, Collate would look at ‘black cat’ and ‘white mat’, see
that mat/cat align because they satisfy the fuzzy match test, and so
declare black/white align<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">If there is still no match: collate tests
the second word in the master against the first word in the witness. If
they match: Collate concludes that the first word in the master is
omitted.<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Still no match: collate tests the second
word in the witness against the first word in the base. If they match:
Collate concludes that the first word in the witness has been added. Now,
here is an important point: after establishing that the first word in the
witness has been added, Collate goes around again to collate the SECOND
word in the witness against the first word of the base, and reports a
SECOND variant at this point. For example: if the base has ‘mat’ and the
witness has ‘black cat’ Collate could report that ‘black’ has been added,
and that ‘cat’ is a variant on ‘cat’. </span><span lang="IT">See further
below on additions and omissions.</span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Still no match: Collate guesses that maybe
the problem is word division. So it concatenates words in the base and the
witness, comparing as it does, to see if it can find a match<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Still no match: Collate starts searching
for phrase variants: addition/omission/replacement. In essence, it looks
further along the text, seeking to find sequences that match, with
everything up to the match a replacement/addition/omission. This is
probably the least sophisticated part of the current collate. Collate also
has a limit of 50 words for its look up: this might be lifted.<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">After Collate has found the match on this
first word: it then looks to check if the NEXT match between this witness
and the base is an addition in this witness. The reason it needs to do
this is explained in the section on additions and omissions below.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">After Collate has done this
for this word in the base against this witness: it goes on to do the same for
the next witness against this same base. As it identified each alignment, it stores
the alignment information for each witness. When it has worked its way through
all the witnesses for this word, and has stored the alignment for each: it
proceeds through the next stages of adjusting the alignment and finally
identifying the actual variants.<o:p></o:p></span><br />
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">After completing, as
described, the alignment for the first word of the base against all the
witnesses: Collate now goes on to align the second word of the base. Notice
particularly what happens when Collate discovers that it has already matched past
the next word in the witness: when, say, the first six words of the base have
been replaced by the first eight words of the witness. In that case, Collate
will skip over that witness until it is collating word 7 of the base: it will
then restart the collation by collating word 9 of the witness against that
word.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Alignment is NOT
variant identification</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">I have spoken so far only
of text alignment, not variant identification. The difference is important.
Here is an example:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b>Base</b> the black cat<br /><b>
A</b> The black cat<br /><b>
B</b> THE BLACK CAT<br /><b>
C</b> The black cat<br /><b>
D</b> The, black cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">For the purposes of text
alignment: the collation algorithm should ignore case differences, punctuation
tokens, and XML encoding around or within the words. Thus: it should identify
the first word of each one of the four witnesses as aligned against the first
word of the base. But note: it might be desirable to identify each or any one
of the four first words as having a variant at this point. This variant
identification is to be done at a later point. For now, all we have to do is
state that the first word in each of the four witnesses aligns against the
first word of the base.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Additions and
omissions in Collate<o:p></o:p></span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This is a particularly
difficult area. I discovered in the course of writing Collate0-2 that different
people want all kinds of different things. Some people do not want to see
additions and omissions at all, but only replacements of shorter or longer
phrases by longer or shorter ones. When it is an addition and the scholar
wants this seen as a replacement of a shorter phrase by a longer one, some
scholars want to see the addition attached at the beginning, some at the end.
Take this text:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b>Base</b>: a cat<br /><b>
witness</b>: a black cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Here are the possibilities:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] a; black added<br />
cat ] cat (writing the addition with the PRECEDING word)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">OR<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] a<br />
cat] cat; black added (writing the addition with the FOLLOWING word)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">OR<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] a black<br />
cat ] cat (as phrase, addition with preceding word)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">OR<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] a<br />
cat ] black cat (as phrase, addition with following)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">OR<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] a<br />
… ] black<br />
cat ] cat (this is actually the system used in Munster!)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Collate0-2 supports all
these possibilities, but does it in a rather inelegant way. Essentially, it
tries to adjust for these possibilities WHILE it collates. This is complex and
inflexible. Instead, I propose that CollateXML separate completely the
discovery of alignment, its storage and its expression. Collate0-2 almost does
this, but does not do it thoroughly. Broadly, I began with outputting the
variants as the program discovered them. Increasingly I found that one needed
to adjust the variation in various ways, and so moved towards separation of the
stages of alignment discovery, storage and variant identification. A major
benefit of this separation is that it permits adjustment of the variation at the
storage point: see next. However, Collate0-2 never quite managed a complete
movement to this separation. I propose that CollateXML has this separation at
its heart.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The storage of
alignment information</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
So far, I have been describing how Collate discovers alignment. Essentially,
Collate discovers, for each word in the base text, for a particular witness,
exactly what alignment is present in a given witness at that point. </span><span lang="IT">The possibilities are:</span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">base and witness align at this word,
either because there is no variation (base and witness agree on this word)
or because base and witness vary at this word (either as: omission of this
word, or variation on the word)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">base and witness align and there is an
addition before and/or after this word (note: this includes the
possibility of an addition before the word, omission of the word, and
addition after the word)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the word is the beginning of a phrase
alignment, with or without an addition before the phrase aligment (note:
this includes the possibility of phrase omission<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the word is the ending of a phrase
alignment, with or without an addition after the phrase alignment
(including, the possibility of phrase omission)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the word falls within a phrase alignment
(for example: ‘the black cat’ replaced by ‘a white mouse’. When Collate
comes to collate ‘black’ in the base against this witness, it will find
that it falls within a phrase alignment and move onto the next word.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">It can be seen that
alignment is much complicated by the need to deal with ‘additions’. Again,
Collate0-2 never quite dealt with this as well as it needs to, and again I
propose to remedy this in CollateXML. </span><span lang="IT">I suggest that they
be dealt with as follows:</span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The base text is seen as a series of
slots, corresponding to the words AND the space before the first word, between
each word, and after the last word<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The variants in each witness be aligned
against these slots. Thus: an addition before the first word is aligned
against the slot before the first word; a variant at the first word is
aligned against the first word; an addition between the first and second
words is aligned against the slot between these words, and so on.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">One may illustrate this
with the base ‘black cat’ collated against the witness ‘the black and white
cat’. Numbering each ’slot’ in the base from zero, we have:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<table border="0" cellpadding="0" class="MsoNormalTable" style="mso-cellspacing: 1.5pt;">
<tbody>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>numbers</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">1</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 2</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">3</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">4</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 1;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>base</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 2; mso-yfti-lastrow: yes;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">and</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">white</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
</tr>
</tbody></table>
<br /><span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Thus: the additions ‘the’
and ‘and white’ align against slots 1 and 3. In this system, even numbers are
used for words; odd numbers for the spaces between the words. (I am indebted to
the Institute for New Testament Research, Munster, for this numbering system,
and this conception of the base as a series of slots for both words and the
spaces between them).<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The adjustment of
variant information: a relatively simple case</span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b><br /></b>
The principal benefit of the separation of the stages of alignment discovery,
storage and output is that it permits adjustment of the variant alignments at
the storage stage and before the output.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Consider the case of the
following (rather fictitious) instance:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b>Base</b>: The cat sat on the
mat<br /><b>
Witness</b>: The black sat on the mat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Left to itself, Collate0-2
will tell us that the variant is:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">cat ] black<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">But in fact, what has
happened here is more correctly:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The ] The<br />
.. ] black<br />
cat ] omitted<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">That is: first ‘black’ is added,
then somehow ‘cat’ is omitted.</span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
We may have many witnesses which read ‘The black cat’ where the base reads ‘The
cat’. In this case, at the storage stage, we should expect Collate to look over
the variants discovered in the other witnesses, find that in many others we
have ‘black’ added, and it should then adjust the stored variant information so
that instead of reading:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<table border="0" cellpadding="0" class="MsoNormalTable" style="mso-cellspacing: 1.5pt;">
<tbody>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>numbers</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">1</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 2</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 3</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 4</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">5</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 6</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 1;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>base</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 2;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness1</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 3;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness2</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 4; mso-yfti-lastrow: yes;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness2</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
</tbody></table>
<br /><span lang="IT">the stored representation reads:</span><br />
<span lang="IT"><br /></span>
<table border="0" cellpadding="0" class="MsoNormalTable" style="mso-cellspacing: 1.5pt;">
<tbody>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>numbers</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">1</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 2</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 3</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 4</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">5</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"> 6</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 1;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>base</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 2;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness1</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 3;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness2</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
<tr style="mso-yfti-irow: 4; mso-yfti-lastrow: yes;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT"><b>witness2</b></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">the</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">black</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">cat</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">sat</span></div>
</td>
</tr>
</tbody></table>
<br /><span lang="EN-GB" style="mso-ansi-language: EN-GB;">That is: with ‘black’
matching against the space between ‘the’ and ‘cat’ (as it is in other
witnesses) rather than against ‘cat’.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Collate0-2 did NOT do this
variant adjustment. CollateXML should do it. In the next section, I consider
some possibilities.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The adjustment of
alignment information: towards multiple progressive alignment</span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b><br /></b>
This matter of automated adjustment of variant information at the storage stage
— that is, after the collation of a particular word has finished — is one area
where the algorithms of Collate0-2 could be dramatically improved.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Consider, first, this case:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b>base</b> the
white and black cat<br /><b>
witness1</b> the black and white cat<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Collate0-2 will record this
as a single piece of variant information: that the whole phrase ‘white and
black’ in the base has been replaced by the whole phrase ‘black and white’. It
has been pointed out to me, quite separately, by two very different groups of
scholars, that this is inadequate (the two groups are: the Münster institute,
and the department of Molecular Biology in Cambridge):<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">This does not record that the words of the
variant text ‘black and white’ are actually the same as those of the base<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">As a result: suppose that a second witness
has ‘green or blue’ for this phrase. To the program (and hence, to any
system based on it) there is exaclty as much difference between the
variants ‘black and white’ and ‘green or blue’ and the base text ‘white or
black’ as there is between each variant and the base text. But this loses
a key piece of information: that the variant ‘black and white’ is actually
much close to the base than is the variant ‘green or blue’.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">CollateXML needs to find a
way of adjusting the variant store to show that in fact the variant ‘black and
white’ represents not one, but four pieces of information:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">firstly, that there is a phrase variant
(the existing Collate0-2 algorithms do this)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">secondly, that actually each word in the
phrase variant does agree with the base: a further three pieces of
information (Collate0-2 goes a little way towards this, but not far
enough)<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Consider, further, this
case:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b>base</b> the
white and black cat<br /><b>
witness1</b> the black and white cat<br /><b>
witness2</b> the black and green cat<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Here, we should show that
witness2 both has a phrase variant AND is a witness for the words ‘black’ ‘and’
— and, furthermore, has a variant ‘green’ on the word ‘white’ in both the base
and witness1. One wants an output as follows:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">white and
black ] black and white witness1; black and green witness2<br />
white ] witness1; green witness2<br />
and ] witness1 witness2<br />
black ] witness1 witness2<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">If we can figure out a way
to store this information then we are well on our way to collation nirvana:
multiple progressive alignment. But before we get to that place: we have to
understand parallel segmentation.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Variant information
storage and parallel segmentation</span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b><br /></b>
Perhaps the single most important development in Collate2 was the support for
parallel segmentation. I write about this in the ‘Collation rationale’ on the
Miller’s Tale CD-ROM. The example I use there is<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This
Carpenter hadde wedded newe a wyf<br />
This Carpenter hadde wedded a newe wyf<br />
This Carpenter hadde newe wedded a wyf<br />
This Carpenter hadde wedded newly a wyf<br />
This Carpenter hadde E wedded newe a wyf<br />
This Carpenter hadde newli wedded a wyf<br />
This Carpenter hadde wedded a wyf<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">In that article I explained
that in the early versions of Collate, we used to collate this by what I called
‘base text collation’: that is we would compare each witness (54 in this case)
word by word with this one base, one witness at a time, and output the
variation so:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This
] 54 witnesses<br />
Carpenter ] 54 witnesses<br />
hadde ] 54 witnesses<br />
wedded ] 53 witnesses; E wedded 1 witness<br />
wedded newe ] newe wedded 1 witness, newli wedded 1 witness<br />
newe ] 26 witnesses; newly 1 witness; omitted 1 witness<br />
newe a ] a newe 23 witnesses<br />
a ] 30 witnesses<br />
wyf ] 54 witnesses<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">We see here that for the
first three words and the last word there is no variation, and we just state
accordingly that all witnesses there agree with the base and with each other.
All the variation occurs on the three base text words ‘wedded newe a’. This
variation is actually recorded against five lemmata: in turn ‘wedded’, ‘wedded
newe’, ‘newe’, ‘newe a’ and ‘a’. Observe that the phrases ‘wedded newe’ and
‘newe wedded’ both overlap one other, and also overlap the three words ‘wedded’
‘newe’ ‘a’.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The ‘Collation rationale’
article goes on to explain why we became increasingly dissatisfied with this
method. One factor was that it highlighted the base text: by referring all
variation to this base text, it gave the base text a prominence which we did
not think appropropriate. We thought of the base text as just a series of slots
on which we hung the collation: but this mode of expression seemed to give it
an authority beyond this. It is not that we do not believe in ‘edited’ texts:
just that this base text was not conceived, or intended to be, any such edited
text. But its prominence made it look as if it could be such an edited text.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">A second factor was the
argument put to us by the evolutionary biologists: that where variant lemmata
overlap, as they do in the cases of the five variants on the three words ‘wedded
newe a’, one cannot compare directly the different witnesses. Here, we have one
set of variants on the phrase ‘wedded newe’ and a second on the phrase ‘newe
a’, as well as variants on each individual word. If manuscript A has a variant
on ‘wedded newe’ and B has one on ‘newe a’ there is no way one can compare the
text of A and B directly, and make any statement at all about the relationship
between A and B at those points.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This defect in base text
collation had other implications. We wanted to be able to point at any word in
any manuscript and say: what readings do the other manuscripts have at this
point? But this was exactly what our system could not do. With our system, we
could only say: at this word, the base text has such and such. We could not
always say: at this word, here are all the readings found at this point in all
the other texts. Similarly, we wanted to be able to compare any two (or more)
manuscripts word by word, showing exactly how they differ. Once more, this
system could not do that: we could only show how they severally differed from
the base text, not how they differed from each other.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The only cure for this we
could see was: eliminate overlapping variation. This meant that we should refer
all variants in all witnesses to the same base lemma. This meant that,
practically, the unit of variation has to be fixed by the longest variant
present at any point. In the case of the Miller’s Tale example: with base text
collation we have five sets of lemmata in the three word base sequence ‘wedded
newe a’, and so cannot compare the witnesses on any one of the lemmata with
those for any other. To eliminate all overlapping variation here we should have
one lemma and one lemma only: all three words of the base text here. All
variants on this one lemma are then directly in parallel with each other. The
whole text, across all the witnesses, is broken into parallel segments, with
text of any one witness at any one segment being directly comparable to the
text of any other witness at that segment: hence, the name ‘parallel
segmentation’.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This is the collation given
by the base text collation system, with five different lemmata:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">wedded
] 53 witnesses; E wedded 1 witness<br />
wedded newe ] newe wedded 1 witness, newli wedded 1 witness<br />
newe ] 26 mss; newly 1 witness; omitted 1 witness<br />
newe a ] a newe 23 witnesses<br />
a ] 30 witnesses<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Now, this is the collation
given by parallel segmentation, with just one lemma:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">wedded
newe a ] wedded newe a 25 witnesses<br />
wedded a newe 23 witnesses<br />
newe wedded a 1 witness<br />
E wedded newe a 1 witness<br />
wedded newly a 2 witnesses<br />
newli wedded a 1 witness<br />
wedded a 1 witness<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">How did Collate0-2 identify
parallel segments? Collate0-2 used a system of variant information storage
similar to that outlined above: essentially, creating a table which in numeric
form exactly what words in each witness correspond with what words in the base.
It would update this table after each word collated in the base. Then, it would
inspect the table, and ask: is there a variant lemma open a this point? If
there were, then it would not output any apparatus, but move on to the next
word, and only when it found no variant lemmata open would it output all the
variants on the whole segment of text.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Thus, for the base text
sequence ‘wedded newe a’ it would proceed as follows. It would collate the
first word, ‘wedded’, and discover the following:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">wedded ] 53
witnesses; E wedded 1 witness<br />
wedded newe ] newe wedded 1 witness, newli wedded 1 witness<o:p></o:p></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">That is, the lemma ‘wedded
newe’ is still open after collation of ‘wedded’. So no apparatus is output, and
it goes onto the next word:<o:p></o:p></span><br />
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">newe ] 26
mss; newly 1 witness; omitted 1 witness<br />
newe a ] a newe 23 witnesses<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Now, the lemma ‘wedded
newe’ has been closed. But another variant lemma ‘newe a’ is now open. So we
have to carry on to the next word:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">a ] 30
witnesses<o:p></o:p></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Now, at last: no phrase
variant is open. We can close the segment, and output all the variation found
on the whole phrase ‘wedded newe a’.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The limits of
parallel segmentation: toward progressive multiple alignment</span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b><br /></b>
Parallel segmentation has served us well. It has allowed us to remove the base
text from the apparatus output completely: on our publications now, you do not
see the base text at all. We still use a base text when we collate, but its
function now is purely to identify the variants present at each point, and we
customarily optimize it for that purpose (for example, adding or rearranging
words to improve variant identification). The move to parallel segmentation has
other benefits. We can now identify at any point in any witness just what
witnesses are present at that point; we can compare any two (or more)
witnesses; we can create much richer analyses of stemmatic relations. But we
are still not happy.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">In the ‘Collation
Rationale’ argument I cite the variants on the first four words of line 646 of
the Miller’s Tale (’He was agast so of Nowelys flood’).<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was agast
so 33 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was agast
4 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">So he was
agast 6 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was so
agast 7 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">He was agast and feerd 2 witnesses<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l3 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">So was he
agast 1 witness</span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Just as a presentation of
the variation at this point, this is quite efficient. But as a representation
of the exact linkages between the witnesses, it is rather inefficient. These
six variants are presented in simple parallel, as if no two of them are any
closer than any other. But manifestly, that is not true. The second and fourth
readings ‘He was agast’ and ‘He was so agast’ are much closer to the first
reading ‘He was agast so’ than they are to either the third and sixth readings.
In turn, the third and sixth readings ‘So he was agast’ and ‘So was he agast’
are much nearer each other than they are to the other readings.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">With parallel segmentation,
once it has found the segments, the collation stops and just presents the
segments it has found. In this collation system, all variants at any point are
equally unlike. We require some system of grouping the variants within each
segment. For this example, I proposed that that the six variants here should be
grouped into two variant sequences:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l1 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">46 witnesses: made up of variants 1, 2, 4,
5, all beginning with the words ‘He was..’<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l1 level1 lfo2; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Seven witnesses: made up of variants 3 and
6, both beginning with ‘So’<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">We can break up the first
group still further:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">40 witnesses, made up of variants 1 and 4,
having the same words but with ’so agast’ transposed<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l2 level1 lfo3; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">6 witnesses, made up of variants 2 and 5,
both omitting ’so’<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Finally, we note that the
two groups 1 and 2 (of 46 and 7) are linked together via variant 1 (from group
1) and variant 3 (from group 2): these differ only in their placement of the
word ’so’. </span><span lang="IT">We can represent this schematically as follows:</span><br />
<span lang="IT"><br /></span>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgi9ErH_myMV80nEb0Wip3Uhkvy6dVF5mM1EVAQMq8K2BsFRQUmGs6pNf_dlzcWuYGrkjI9W-NqMsUZEO6FDnvZkuGybhfDJxqpKvYR_knveYzyCk4tDud4K4g_kfq_shh-gbheMhTk0A/s1600/agasteg.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgi9ErH_myMV80nEb0Wip3Uhkvy6dVF5mM1EVAQMq8K2BsFRQUmGs6pNf_dlzcWuYGrkjI9W-NqMsUZEO6FDnvZkuGybhfDJxqpKvYR_knveYzyCk4tDud4K4g_kfq_shh-gbheMhTk0A/s1600/agasteg.jpg" height="124" width="320" /></a></div>
<span lang="IT">
</span><br />
<span lang="IT"><!--[if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600"
o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f"
stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/>
<v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" alt="aligned variants"
style='width:417pt;height:163pt'>
<v:imagedata src="file://localhost/Users/peter/Library/Caches/TemporaryItems/msoclip/0/clip_image001.png"
o:href="http://www.sd-editions.com/blogimages/diagram.gif"/>
</v:shape><![endif]--><!--[if !vml]--><!--[endif]--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">From examination of the
variant map, we can see that — rather remarkably (or not!) — this representation
mirrors the textual history of the tradition. The original reading is likely to
have been variant 1 (33 mss). Three variants descended directly from variant 1:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Variant 5 (seven witnesses)
by transposition of ’so’, from which a further variant (variant 6, one witness)
develops, by transposition of ‘he was’<br />
Variant 4 (7 witnesses) by transposition of ‘agast so’<br />
The ancestor of variants 2 and 5: both omit the ’so’, while 5 adds ‘and
feerd’..<o:p></o:p></span><br />
<br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Indeed, this distribution
is consistent with other groupings established by our analysis.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">So, here is the challenge I
set in the ‘Collation Rationale’ article, here set out in more detail:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo4; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Identify relationships between the variant
groups found by parallel segmentation<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo4; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Work out a way of storing the information
about these relationships, so as to enable different kinds of output<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo4; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Work out the best ways of expressing this
information, in some kind of hierarchical or layered form.<o:p></o:p></span></li>
</ol>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">At present, we do not have
any means of formally expressing the relationships between the variant groups
found by parallel segmentation. Here is a draft of how it might be done, using
the example above:<o:p></o:p></span><br />
<table border="0" cellpadding="0" class="MsoNormalTable" style="mso-cellspacing: 1.5pt;">
<tbody>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">He was agast so</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
<tr style="mso-yfti-irow: 1;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">He was so agast</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
<tr style="mso-yfti-irow: 2;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">He was agast</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
<tr style="mso-yfti-irow: 3;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">He was
agast and feerd<o:p></o:p></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
<tr style="mso-yfti-irow: 4;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">So he was agast</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
<tr style="mso-yfti-irow: 5; mso-yfti-lastrow: yes;">
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<span lang="IT">So was he agast</span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="IT" style="font-size: 10.0pt;"><o:p><br /></o:p></span></div>
</td>
<td style="padding: .75pt .75pt .75pt .75pt;">
<div class="MsoNormal">
<br /></div>
</td>
</tr>
</tbody></table>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]--><!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">We need an adaptation of
the system used by the TEI to hold this. </span><span lang="IT">Ideas please!</span><br />
<span lang="IT"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Variant
identification</span></strong><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><b><br /></b>
So far, we have aligned the texts, stored the alignment identification, and
then adjusted the alignment information (we hope, through some form of multiple
progressive alignment). But we have not yet identified any variants. Now,
consider again our example from above:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">A
The black cat<br />
B THE BLACK CAT<br />
C The black cat<br />
D The, black cat<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Following parallel
segmentation, we may now ignore the base. We look at the first word, and find
they are aligned as follows:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">A
The<br />
B THE<br />
C The<br />
D The,<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Are these, or are these
not, variants of each other? I propose that Collate3 have, for each witness, a
specifications object. This will state, for each witness, whether differences of
case, XML encoding, and punctuation are to be treated as variants or not.
Presume that we direct that case differences and XML encoding are not variants
but that punctuation is. We would get the following collation taking A as the
base<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The
] A B C; The, D<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Taking B as the base: the
variant would appear as<o:p></o:p></span><br />
<div class="MsoNormal">
<span lang="IT">THE ] A B C; The, D</span></div>
<div class="MsoNormal">
<span lang="IT"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Or, if we say that
punctuation is not significant, but XML encoding is significant, we will get
this collation:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div class="MsoNormal">
<span lang="IT">The ] A B D; The C</span></div>
<div class="MsoNormal">
<span lang="IT"><br /></span></div>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Variant
identification and the return of the base</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">I said above we may now
discard the base. To a point, Lord Copper (esoteric joke, see Decline and
Fall). There is one critical operation for which we still must retain the base.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The question is to do with
the use of variant specifications to identify exactly what is a variant.
Suppose for our pair A and D we have the variants THE (A) and The, (D). </span><span lang="IT">We have the following variant specifications:</span><br />
<span lang="IT"><br /></span>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">A: ignore
case and punctuation<br />
D: ignore case but do not ignore punctuation<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">We now compare A and D:
‘THE’ and ‘The,’. From the point of view of A: there is no variant here,
because we are ignoring both case and punctuation. But from the point of view
of D: there is a variant, because we there is a difference of punctuation.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Thus: the variation found
changes, according to the point of view. It changes too (obviously) according
to which witnesses we are comparing. The only way I can see out of this is to
use the base as the measure against which variants are identified, but always
do the variant identification using the specifications set for the witness. In
this case, presume that the base here is ‘The’, with all witnesses set to
ignore case but not punctuation. We will then have:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The
] A B C; The, D<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Notice that depending on
the base text and the collation specifications, we could get very different
results. Suppose that we set punctuation to be ignored in A B C but not D. If
we use ‘The’ as the base text, we get this:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The
] A B C; The, D<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">But if we set ‘The,’ as the
base, we get this:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The,
] A B C D<o:p></o:p></span></div>
<div style="margin-left: 30.0pt;">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">I don’t see any way around this.
One could avoid this (as Collate0-2 did) by insisting that all witnesses have
the same collation specifications. But it has been forcefully represented to me
that it would be very useful to be able to specify different treatments of
case/punctuation/xml for different witnesses. </span><span lang="IT">So we will
do this.</span><br />
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal">
<i style="mso-bidi-font-style: normal;"><span lang="IT">February
6, 2007<o:p></o:p></span></i></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="IT">Datastructures
for CollateXML<o:p></o:p></span></b></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="IT"><br /></span></b></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Filed
under: CollateXML datastructures — Peter @ 5:46 am <o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">From the account of the
collation, we are dealing with something very different from ’string comparison’.
Indeed, the base unit of the collation is the word: we collate words, not
strings. Words may be concatenated, or divided: but words are the basis of it
all. (This was the form used by Collate).<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">For each witness, we need
the following information:<o:p></o:p></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">Its sigil</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Its location (in Collate0-2 this was
simply a file name; in CollateXML it might be a url, an xquery or xpath
expression, etc)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Collation specifications for this witness.
</span><span lang="IT">See below.</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">For each collateable block: two
collateable object arrays. </span><span lang="IT">See below</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">For each collateable block: an array of
correspondences with the base. </span><span lang="IT">See below.</span></li>
</ol>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The collation
specifications for variant identification</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">These will control the way
what is recorded as a variant against the base. Settings include:</span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
a. case. settings will be collate/ignore.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">if collate: Collation will
treat differences of case as variants.<br />
if ignore: Collation will not treat differences of case as variants.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">b. xml. ignore xml.
Settings will be: all/none/nomininated<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">If none: all xml encoding
surrounding, within or between the words will be ignored<br />
If all: all xml encoding will be collated, including empty elements,
surrounding, within, and between the words<br />
If nominated: only specified xml elements will be nominated. The details of the
xml elements to be collated will be held in a further structure (see below).<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">c. xmlcollate: null unless
xml=nominated. This structure is a series of elements to be collated, as
follows:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">i. gi: the gi of the
element to be collated (including namespace)<br />
ii. attributes: Values are all/none/nominated. If all: all attributes and their
values are to be collated; if none, all attribute values are ignored, and only
element names are collated; if nominated, details of attributes to be collated
are held in a further structure<br />
iii. collateattributes: null unless attributes=nominated. This structure is a
series of attribute names which will be collated for this element (this could
be further elaborated, perhaps, to set conditions: report as variant if the
attribute is a particular value)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">d. punctuation. Settings
will be all/none/nominated<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">if all: collate all
punctuation, as identified by the isPunctuation method<br />
if none: collate no punctuation, as identified by the isPunctuation method<br />
if nominated: collate only specific punctuation identified by the isPunctuation
method<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The specifications object
must also have at least one method: isPunctuation. For a particular pair of
strings, this should identify whether differences between them are purely
punctuation (in which case, they might or might not be variants) or not.</span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
Two other methods might be required:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">isCaseDifference: if it is
found that Java’s native methods for ignoring case difference when comparing
strings are not adequate.<br />
adjustXML: for some contexts, we may need to do more than simply ignore/not
ignore XML. </span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Consider:<br />
ex&per;perience<br />
One might here wish to ignore the &per; element and treat this as
‘experience’.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The collation
specifications for text alignment </span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="IT">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The model here proposed, of
separating text alignment from variant identification, presumes that optimal
text alignment would be achieved by ignoring differences of case, punctuation
and xml. Thus, at the alignment stage, we would use the minimal set of
collation specifications for comparison of witnesses with the base.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Hierarchical setting
of collation specifications</span></strong><b><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
</span></b><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br />
One would expect that for most collations, one would have identical
specifications for all witnesses. In programming terms: one would set the
specifications for the class of witnesses, which would then inherit a uniform
set of specifications. This design permits that the uniform specification would
be overruled for specific witnesses.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;">The collateable
object arrays</span></strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><o:p></o:p></span><br />
<strong><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></strong>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The key to Collate0-2 was
that it did not collate text strings: it collated word objects. For each
witness, it held the words of the text in an array of word objects, numbered
from 0 to xxx, and all collation took place against these word objects, with
information about variants found stored in tables of numbers referring to these
arrays. I propose that CollateXML retain, refine and extend this model.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Collate0-2 accepted ‘plain
text’ and converted this to word object arrays as it collated. As it did so, it
might remove (depending on various settings) punctuation or other characters
from the text to be collated. Thus ‘april / that’ would become:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">word 1: April<br />
word 2: that<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Notice that the ‘/’ is here
removed. At a later point, Collate0-2 converted the text to<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><w
n=”1″>April</w> / <w n=”2″>that</w><o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This is rather
unsatisfactory. The relationship between the numbering of the words in the word
object array and that in the converted XML depends on rather fragile
assumptions about what is and is not a word. I propose instead that CollateXML
recommend that for word-by-word collation, input must be in full XML form, with
all discreet elements marked as follows:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><w
n=”1″>April</w> <w n=”2″>/</w> <w
n=”3″>that</w><o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This has several
implications. It means that, because of the problem of overlapping hierarchies,
treatment of elements spanning across words has to be as follows:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><w
n=”1″><hi>April</hi></w> <w n=”2″><hi>/</hi></w>
<w n=”3″><hi>th</hi>at</w><o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">not<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><hi><w
n=”1″>April</w> <w n=”2″>/</w> <hi><w
n=”3″>th</hi>at</w><o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The advantage of the
explicit labelling of every collateable object in the original text as a
<w> element with an ‘n’ attribute is that it makes linking of the
collation with the original text absolutely explicit. The ‘n’ attribute on each
<w element can be used to denote each word in the collateable object array,
and then used to link to the corresponding <w element in the original. (One
might — might — use xPath to achieve the same result: that is a matter for
discussion.)<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">I said we need TWO
collateable object arrays for each witness. The first array, as specified
above, is to hold the original text: call this textOriginal. But in fact, this
is not the text which will be actually collated. The second array is the text
which will be actually collated: call this textCollateable. </span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">TextCollateable
will have identical structure and initially identical content to textOriginal.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">The reason for the two
arrays is to make regularization possible. Regularization was one of the great
strengths of Collate0-2, and the approach here suggested is based closely on
how Collate0-2 worked. As the scholar collates, he or she will see cases where
it is necessary to filter out spelling or other non-significant variation. This
may involve alteration of word division. Thus, we might be collating:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">base: the man Cat<br />
wit1: theman cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">It appears that in wit1 one
will want to change the word division for ‘theman’ and regularize ‘cat’ to
‘Cat’. Thus, textOriginal would hold for wit1:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">word1 theman<br />
word2 cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">while textCollateable must
be altered to:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">word1 the<br />
word2 man<br />
word3 Cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Notice that this will mean
keeping an offset pointer at each word, indicating for each array what is the
corresponding word in the other array.<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Putting this together, we
require the following information for each word object in each collateable
object array:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the word itself (including, XML encoding)<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the n number for the word, to relate to
the n number on the corresponding <w> element in the original<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">the offset to the corresponding word in
the other array. Thus: for word 1 in textCollatable the offset would be 0;
for word 2 and word 3 it would be -1. For word 1 in textOriginal the
offset would be 0; for word 2 it would be +1.<o:p></o:p></span></li>
</ol>
<div class="MsoNormal">
<i style="mso-bidi-font-style: normal;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">June 21, 2007</span></i></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">Goodbye CollateXML, hello CollateX<o:p></o:p></span></b></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></b></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Filed
under: Introduction — Peter @ 1:32 pm <o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">At last, we are moving.
Today I began setting up the source forge site that will take all the code as
we start work on the program. And in the process, I did what I have been
planning to do for some time: change the name of the program from CollateXML to
CollateX. Those who have read all the postings on this (of course, all of you)
will know the reason for this change. We plan that the program should be able
to collate texts in any format whatever, by devising a single canonical input
form and then having translators into this canonical form. Thus it will be able
to collate XML, sure: but it will also be able to collate many other formats,
including indeed old-style Collate 1-3 files.</span><br />
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<i style="mso-bidi-font-style: normal;"><span lang="IT">June 28,
2007<o:p></o:p></span></i></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="IT">Collate
examples, and first task<o:p></o:p></span></b></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;"><span lang="IT"><br /></span></b></div>
<div class="MsoNormal">
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Filed
under: Collate — Peter @ 10:02 pm <o:p></o:p></span></div>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">We have decided that the
logical place to start is by definition of data structures for the common input
phase. So Andrew will get on today with working these out. Here are a few
example sets for him to chew on:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Base the black cat<br />
A The black cat<br />
B THE BLACK CAT<br />
C The black cat<br />
D The, black cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">2:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">Base the white and black
cat<br />
A The black cat<br />
B the black and white cat<br />
C the black and green cat<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">3:<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">This Carpenter hadde wedded
newe a wyf<br />
This Carpenter hadde wedded a newe wyf<br />
This Carpenter hadde newe wedded a wyf<br />
This Carpenter hadde wedded newly a wyf<br />
This Carpenter hadde E wedded newe a wyf<br />
This Carpenter hadde newli wedded a wyf<br />
This Carpenter hadde wedded a wyf<o:p></o:p></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;"><br /></span>
<span lang="IT">4.</span><br />
<ol start="1" type="1">
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was agast
so 33 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was agast
4 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">So he was
agast 6 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">He was so
agast 7 witnesses</span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="EN-GB" style="mso-ansi-language: EN-GB;">He was agast and feerd 2 witnesses<o:p></o:p></span></li>
<li class="MsoNormal" style="mso-list: l0 level1 lfo1; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto; tab-stops: list 36.0pt;"><span lang="IT">So was he
agast 1 witness</span></li>
</ol>
<span lang="IT">5. Time for some XML:</span><br />
<span lang="IT"><l id=”MI-35-El” n=”35″>&paraph; <w
n=”1″>This</w> <w n=”2″>Carpente&rtail;</w> <w
n=”3″>hadde</w> <w n=”4″>wedded</w> <w
n=”5″>newe</w> <w n=”6″>a</w> <w n=”7″>wyf</w>
</l><br />
<l id=”MI-35-Ii” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpenter</w> <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>a</w> <w
n=”6″>newe</w> <w n=”7″>wi&ftail;</w> </l><br />
<l id=”MI-35-Cn” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpenter</w> <w n=”3″>had</w> <w
n=”4″>newe</w> <w n=”5″>wedde&dtail;</w> <w
n=”6″>awif</w> </l><br />
<l id=”MI-35-Cp” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpunter</w> <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>a</w> <w
n=”6″>newe</w> <w n=”7″>wy&ftail;</w> </l><br />
<l id=”MI-35-Hg” n=”35″>&paraph; <w n=”1″>This</w> <w
n=”2″>Carpenter</w> &virgule; <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>newe</w> <w
n=”6″>a</w> <w n=”7″>wyf</w> </l><br />
<l id=”MI-35-Gg” n=”35″><w n=”1″>This</w> <w
n=”2″>carpenter</w> <w n=”3″>hadde</w> <w
n=”4″>weddid</w> <w n=”5″>newe</w> <w
n=”6″>a</w> <w n=”7″>wyf</w> </l></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">6 Some more XML, this time
with more encoding<o:p></o:p></span><br />
<span lang="IT"><l id=”MI-1-Bo1″ n=”1″><w n=”1″><hi rend
=”unex” ht=”3″>w</hi><hi rend=”ul”>Hilom</hi></w>
<w n=”2″>ther</w> <w n=”3″>was</w> <w
n=”4″>duelling</w> <w n=”5″>in</w> <w
n=”6″>Oxenford</w> <note<br />
<l id=”MI-1-Cx1″ n=”1″><w n=”1″><hi ht=”2″ rend=”other”>W</hi>Hilom</w>
<w n=”2″>therwas</w> <w n=”3″>dwellyn&gtail;</w>
<w n=”4″>in</w> <w n=”5″>Oxenforde</w> </l><br />
<l id=”MI-1-Bw” n=”1″><w n=”1″><hi ht=”5″
rend=”orncp”>W</hi>ylom</w> <w n=”2″>þer</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>in</w> <w n=”6″>Oxenford</w> </l><br />
<l id=”MI-1-Ch” n=”1″><w n=”1″><hi ht=”2″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>at</w> <w n=”6″>Oxenforde</w> </l><br />
<l id=”MI-1-Dd” n=”1″><w n=”1″><hi ht=”5″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>there</w>
<w n=”3″>was</w> <w n=”4″>dwellyng</w> &virgule;
<w n=”5″>in</w> <w n=”6″>Oxenfor&dtail;</w>
</l><br />
<l id=”MI-80-Bo2″ n=”80″><w n=”1″>As</w> <w
n=”2″>brode</w> <w n=”3″>as</w> <w
n=”4″>is</w> <w n=”5″>þe</w> <w n=”6″>boos</w>
<w n=”7″>of</w> <w n=”8″>a</w> <w
n=”9″>bokelyr</w> </l><br />
<l id=”MI-1-Ad3″ n=”1″><w n=”1″><hi ht=”4″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>in</w> <w n=”6″>Oxenford</w> </l><br />
<l id=”MI-1-Cp” n=”1″><w n=”1″><hi ht=”6″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>þer</w> <w
n=”3″>was</w> <w n=”4″>dwellyn&gtail;</w> <w
n=”5″>at</w> <w n=”6″>Oxenfoor&dtail;</w> </l></span><br />
<span lang="IT"><br /></span>
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></span><br />
<span lang="EN-GB" style="mso-ansi-language: EN-GB;">enough, now!<o:p></o:p></span><br />
<br />
<h4>
Email announcing these posts</h4>
<div>
Sent 6 February 2007 to Joris, Fotis, and other participants in the January 2007 The Hague meeting:</div>
<div>
<br /></div>
<div>
<div class="p1">
Dear everyone</div>
<div class="p1">
at the meeting a week last Friday, we had a good deal of discussion about the future of Collate. Since the meeting, very much under the influence of Barbara, I have undergone a massive conversion.</div>
<div class="p2">
<br /></div>
<div class="p1">
Barbara pointed out that at the meeting, I seemed to be resisting the idea of handing over Collate for others to develop. Indeed I was, and as Barbara pointed out: I was doing exactly what I forever denounce other people for doing: saying 'this is mine! hands off!!'.</div>
<div class="p2">
<br /></div>
<div class="p1">
So now, I have seen the light. And indeed, the more I think of it: this is an ideal project for us all to collaborate on. And I was really impressed with the enthusiasm in The Hague for doing this together. So here is my suggestion: we develop the next Collate (which I suggest should be called CollateXML) together. To start off the process, I have created a blog</div>
<div class="p3">
<span class="s1"><a href="http://www.sd-editions.com/blog">http://www.sd-editions.com/blog</a></span></div>
<div class="p1">
Here you will find a whole series of materials now about Collate, thus:</div>
<div class="p1">
An introduction</div>
<div class="p1">
A history of the three earlier versions of Collate</div>
<div class="p1">
A design outline for CollateXML</div>
<div class="p1">
How CollateXML should work</div>
<div class="p1">
Some Datastructures needed for Collate</div>
<div class="p2">
<br /></div>
<div class="p1">
There will be more to come, but this will get us running! I will put up the code for Collate2, some of it, though I doubt this will be as useful as the explanations given on the blog site.</div>
<div class="p2">
<br /></div>
<div class="p1">
I'd be glad to hand this all over to you folks. I'm sure happy to help out, and provide test and sample files, etc etc, and offer lots more advice where I felt it might help. I'd say we could set this up as a source forge project and, well, get on with it. One place to start would be with implementing the fundamental word by word collation algorithm set out in the 'How CollateXML should work' should work.</div>
<div class="p2">
<br /></div>
<div class="p1">
Well, I've thrown out the stone into the pond now. So folks, who is ready to give up a few years of their life getting this to run???</div>
<div class="p1">
all the best</div>
<div class="p1">
Peter</div>
</div>
<!--EndFragment--></div>
PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0tag:blogger.com,1999:blog-5774054219585481589.post-86551537251458912602013-07-29T17:03:00.001-07:002013-07-31T05:21:56.656-07:00Why digital humanists should get out of textual scholarship. And if they don't, why we textual scholars should throw them out.Three weeks ago, when I was writing my paper for the conference on Social, Digital, Scholarly Editing I organized (with lots of help -- thanks guys!) at Saskatoon, I found myself writing this sentence:<br />
<br />
<span style="font-size: x-small;"> <span style="font-family: Cambria;">Digital
humanists should get out of textual scholarship: and if they will not, textual
scholars </span></span><br />
<span style="font-family: Cambria; font-size: x-small;"> should throw them out.</span><br />
<span style="font-family: Cambria; font-size: 12pt;"><br /></span>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Revision>0</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Pages>1</o:Pages>
<o:Words>17</o:Words>
<o:Characters>103</o:Characters>
<o:Company>Institute for Textual Scholarship and Electronic Ed</o:Company>
<o:Lines>1</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:CharactersWithSpaces>119</o:CharactersWithSpaces>
<o:Version>14.0</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:Cambria;}
</style>
<![endif]-->
<!--StartFragment--><span lang="EN-US" style="font-family: Cambria; font-size: 12.0pt; mso-ansi-language: EN-US; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-fareast-font-family: "MS 明朝"; mso-fareast-language: JA;"> </span><!--EndFragment-->As I wrote it, I thought: is that saying more than I mean? Perhaps-- but it seemed to me something worth saying, so I left it in, and said it, and there was lots of discussion, and some misunderstanding too. Rather encouraged by this, I said it again the next week, this time at the ADHO conference in Nebraska. (The whole SDSE paper, as delivered, is now on <a href="http://www.academia.edu/4124828/SDSE_2013_why_digital_humanists_should_get_out_of_textual_scholarship">academia.edu</a>; the slides from Nebraska are at <a href="http://www.slideshare.net/PeterRobinson10/peter-robinson-24420126">slideshare</a><span id="goog_839852113"></span><span id="goog_839852114"></span><a href="http://www.blogger.com/"></a> -- as the second half of a paper, following my timeless thoughts on what a scholarly digital edition should be).<br />
<br />
So, here are a few more thoughts on the relationship between the digital humanities and textual scholarship, following the discussion which the two papers provoked, and numerous conversations with various folk along the way.<br />
<br />
First, what I most definitely did not mean. I do not propose that textual scholars should reject the digital world, go back to print, it's all been a horrible mistake, etc. Quite the reverse, in fact. Textual scholarship is about communication, and even more than other disciplines, must change as how we communicate changes. The core of my argument is that the digital turn is so crucial to textual scholars that we have to absorb it totally -- we have to be completely aware of all its implications for what we do, how we do it, and who we are. We have to do this, ourselves. We cannot delegate this responsibility to anyone else. To me, too much of what has gone on at the interface between textual scholarship and the digital humanities the last few years has been exactly this delegation. There were good reasons for this delegation over the first two decades (roughly) of the making of digital editions. The technology was raw; digital humanists and scholarly editors had to discover what could be done and how to do it. The prevailing model for this engagement was: one scholar, one project, one digital humanist, hereafter 1S/1P/1DH. Of course there are many variants on this basic pattern. Often, the one digital humanist was a team of digital humanists, typically working out of a single centre, or was a fraction of a person, or the one scholar might be a group of scholars. But, the basic pattern of close, long and intensive collaboration between the 'one scholar' and the 'one digital humanist' persists. This is how I worked with Prue Shaw on the <i><a href="http://www.sd-editions.com/Commedia/index.html">Commedia</a> </i>and <i><a href="http://www.sd-editions.com/Monarchia/index.html">Monarchia</a></i>; with Chris Given-Wilson and his team on the <i><a href="http://www.sd-editions.com/PROME/index.html">Parliament Rolls of Medieval England</a></i>; and how many other digital editions were made at Kings College London, MITH, IATH, and elsewhere.<br />
<br />
This leads me to my second point, which has (perhaps) been even more misunderstood than the first point. I am not saying, at all, that because of the work done up to now, all the problems have been solved, we have all the tools we need, we can now cut the ties between textual scholarship and digital humanities and sail the textual scholarly ship off into the sunset, unburdened by all those pesky computer folk. I am saying that this mode of collaboration between textual scholars and digital humanists, as described in the last paragraph, has served its purpose. It did produce wonderful things, it did lead to a thorough understanding of the medium and what could be done with it. However, there are such problems with this model that it is not just that it is not needed: we should abandon it for all but a very few cases. The first danger, as I have suggested, is that it leads to textual scholars relying over-much on their digital humanist partners. I enjoyed, immensely, the privilege of two decades of work with Prue Shaw on her editions of Dante. Yet I feel, in looking back (and I know Prue will agree) that too many times, I said to her -- we should do this, this way; or we cannot do that. I think these would have been better editions if Prue herself had been making more decisions, and me fewer (or even none). As an instance of this: Martin Foys' edition of the Bayeux Tapestry seems to me actually far better as an edition, in terms of its presentation of Martin's arguments about the Tapestry, and his mediation of the Tapestry, than anything else I have published or worked on. And this was because this really is Martin's edition: he conceived it, he was involved in every detail, he thought long and hard about exactly how and what it would communicate. (Of course, Martin did not do this himself, and of course he relied heavily on computer engineers and designers -- but he worked directly with them, and not through the filter of a 'digital humanist', ie, me). And the readers agree: ten years on, this is still one of the best-selling of all digital editions.<br />
<br />
The second danger of this model, and one which has already done considerable damage, is that the digital humanist partners in this model come to think that they understand more about textual editing than they actually do -- and, what is worse, the textual editors come to think that the digital humanists know more than they do, too. A rather too-perfect example of this is what is now chapter 11 of the TEI guidelines (in the P5 version). The chapter heading is "Representation of Primary Sources", and the opening paragraphs suggest that the encoding in this chapter is to be used for all primary source materials: manuscripts, printed materials, monumental inscriptions, anything at all. Now, it happens that the encoding described in this chapter was originally devised to serve a very small editorial community, those engaged in the making of "genetic editions", typically of the draft manuscripts of modern authors. In these editions, the single document is all-important, and the editor's role is to present what he or she thinks is happening in the document, in terms of its writing process. In this context, it is quite reasonable to present an encoding optimized for that purpose. But what is not at all reasonable is to presume that this encoding should apply to every kind of primary source. When we transcribe a manuscript of the <i>Commedia</i>, we are not just interested in exactly how the text is disposed on the page and how the scribe changed it: we are interested in the text as an instance of the work we know as the <i>Commedia</i>. Accordingly, for our editions, we must encode not just the "genetic" text of each page: we need to encode the text as being of the <i>Commedia</i>, according to canticle, canto and line. And this is true for the great majority of transcriptions of primary sources: we are encoding not just the document, but the instance of the work also. Indeed, it is perfectly possible to encode both document and work instance in the one transcription, and many TEI transcriptions do this. For the TEI to suggest that one should use a model of transcription developed for a small (though important) fraction of editorial contexts for all primary sources, the great majority of which require a different model, is a mistake.<br />
<br />
Another instance of this hubris is the preoccupation with TEI encoding as the ground for scholarly editing. Scholarly editors in the digital age must know many, many things. They must know how texts are constructed, both as document and work instance; they must know how they were transmitted, altered, transformed; they must know who the readers are, and how to communicate the texts and what they know of them using all the possibilities of the digital medium. What an editor does not need to know is exactly what TEI encoding should be used at any point, any more than editors in the print age needed to know what variety of linotype printer was in use. While the TEI hegemony has created a pleasant industry in teaching TEI workshops, the effect has been to mystify the editorial process, convincing rather too many prospective editors that this just too difficult for them to do without -- guess what -- a digital humanist specialist. This, in turn, has fed what I see as the single most damaging product of the continuation of the1S/1P/1DH model: that it disenfranchises all those scholars who would make a digital edition, but do not have access to a digital humanist. As this is almost every textual scholar there is, we are left with very few digital editions. This has to change. Indeed, multiple efforts are being made towards this change, as many groups are attempting to make tools which (at least in theory) might empower individual editors. We are not there yet, but we can expect in the next few tools a healthy competition as new tools appear. <br />
<br />
A final reason why the 1S/1P/1DH model must die is the most brutal of all: it is just too expensive. A rather small part of the Codex Sinaiticus project, the transcription and alignment of the manuscripts, consumed grant funding of around £275,000; the whole project cost many times more. Few editions can warrant this expenditure -- and as digital editions and editing lose their primary buzz, funding will decrease, not increase. Throw in another factor: almost all editions made this way are data siloes, with the information in them locked up inside their own usually-unique interface, and entirely dependent on the digital humanities partner for continued existence.<br />
<br />
In his <a href="http://brandaen.huygensinstituut.nl/?p=497">post</a> in response to the slides of my Nebraska talk, Joris van Zundert speaks of "comfort zones". The dominance of the 1S/1P/1DH model, and the fortunate streams of funding sustaining that model, has made a large comfort zone. The large digital humanities centres have grown in part because of this model and the money it has brought them -- and have turned the creation of expensively-made data, dependent on them for support, as a rationale for their own continued existence. What is bad for everyone else -- a culture where individual scholars can make digital editions only with extraordinary support -- is good for them, as the only people able to provide that support. I've written elsewhere about the need to move away from the domination of digital humanities by a few large centres (in my contribution to the proceedings of last year's inaugural Australian Association of Digital Humanities conference).<br />
<br />
This comfort zone is already crumbling, as comfort zones tend to do. But beside the defects of the 1S/1P/1DH, a better reason for its demise is that a better model exists, and we are moving towards it. Under this model, editions in digital form will be made by many people, using a range of online and other tools which will permit them to make high-quality scholarly editions without having to email a digital humanist every two minutes (or ever, even). There will be many such editions. But we will have gained nothing if we lock up these many editions in their own interfaces, as so many of us are now doing, and if we wall up the data by non-commercial or other restrictive licenses. <br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
This is why I am at such pains to emphasize the need for this new generation of editions to adopt the creative commons attribution share-alike licence, and to make all created materials available independent of any one interface, as the third and fourth desiderata I list for scholarly editions in the digital age. The availability of all this data, richly marked up according to TEI rules and supporting many more uses than the 'plain text' (or 'wiki markup') transcripts characteristic of the first phase of online editing tools, will fuel a burgeoning community of developers, hacker/scholars, interface creators, digital explorers of every kind. I expressed this in my Nebraska talk this way:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijDwZYS_s_xHx8JU8sioP6wOn2Q-yZRYhNEmL8PrNdDVmE2MxROqG-XBWLKNJuAFRazRGxCNZnVpBlh0GlvKceTvySHXh9MI0KBm6FVTuCE7ELQFsk-sWYFvHcrk8xYwQNYGxWcl8MzQ/s1600/slide.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijDwZYS_s_xHx8JU8sioP6wOn2Q-yZRYhNEmL8PrNdDVmE2MxROqG-XBWLKNJuAFRazRGxCNZnVpBlh0GlvKceTvySHXh9MI0KBm6FVTuCE7ELQFsk-sWYFvHcrk8xYwQNYGxWcl8MzQ/s1600/slide.jpg" height="240" width="320" /></a></div>
<br />
Under this model we can look to many more digital humanists working with textual scholarly materials, and many more textual scholars using digital tools. There will still be cases where the textual scholar and the digital humanist works closely together, as they have done under the 1S/1P/1DH model, in the few scholarly edition projects which are of such size and importance to warrant and fund their own digital support. (I hope that Troy Griffitts has a long and happy time ahead of him, supporting the great editions of the Greek New Testament coming from <a href="http://egora.uni-muenster.de/intf/index_en.shtml">Münster</a> and <a href="http://www.birmingham.ac.uk/schools/ptr/departments/theologyandreligion/research/projects/gospel-john.aspx">Birmingham</a>). But these instances will be not the dominant mode of how digital humanists and textual scholars will work together. At heart, the 1S/1P/1DH model is inherently restrictive. Only a few licenced people can work with the data made from any one edition. Instead, as Joris says, we should seek to unlock the "highly intellectually creative and prolific" potential of the digital environment, by allowing everyone to work with what we make. In turn, this will fuel the making of more and better tools, which textual scholars can then use to make more and better editions, in a truly virtuous circle.<br />
<br />
Perhaps I overdramatized matters, by using a formula suggesting that digital humanists should no longer have anything to do with textual scholarship, when I meant something different: that the model of how digital humanists work with textual scholars should change -- and is changing. I think it is changing for the better. But to ensure that it does, we should recognize that the change is necessary, work with it rather than against it, and determine just what we would like to see happen. It would help enormously if the large digital humanities centres, and the agencies which fund them, subscribed whole-heartedly to my third and fourth principles: of open data, available through open APIs. The first is a matter of will; the second requires resources, but those resources are not unreasonable. I think that it will be very much in the interests of the centres to adopt these rules. Rather quickly, this will exponentially increase the amount of good data available to others to use, and hence incite others to create more, and in turn increase the real need for centres specializing in making the tools and other resources textual scholars need. So, more textual scholars, more digital humanists, everyone wins.PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com6tag:blogger.com,1999:blog-5774054219585481589.post-9966096333145584942013-07-27T05:24:00.000-07:002024-02-11T22:30:19.628-08:00The Woodstock of the Web: Geneva 25-27 May, 1994<h2>
Background: about this article (27 July 2013)</h2>
In 1994 I was based in Oxford University Computing Services, clinging post-doctorally to academic life through a variety of research posts at the intersection of computing and the humanities (creating the Collate program with Leverhulme support, starting off the Canterbury Tales Project with minimal formal support but lots of enthusiasm, advising Cambridge University Press about the new world of digital publication, etc.) Because I had no regular post, people threw me the crumbs they could not eat themselves ("people" being mostly Susan Hockey, Lou Burnard and Marilyn Deegan). At that time, Lou and Harold Short were writing a report outlining the need for a computing data service for the humanities: this report served as the template for what became the Arts and Humanities Data Service (now alas deceased, killed by British Rover -- that's another story). You might still be able to get a copy of this report <a href="http://books.google.com/books/about/An_Arts_and_Humanities_Data_Service.html?id=fgdgGwAACAAJ">online</a>. They had got some money for travel to research the report, more travel than Lou, for his part, could manage: so several times that year Lou (whose office was beside mine in 6 Banbury Road) said to me, "Peter, how would you like to go to..". So I got to go to Dublin to see <a href="http://davidabel4.blogspot.com/2005/05/plummet-from-grace.html">John Kidd</a> commit public academic hari-kari at the James Joyce conference. And, I got to go to Geneva for the first 'WWW' conference, ever. Some time in that period -- 1993, it must have been -- Charles Curran called me into this office at OUCS to show me the latest whizzy thing: a system for exchanging documents on the internet, with pictures! and links! This was the infant web, and OUCS had (so legend goes) server no. 13 (Peter's note 2024: this may have been as early as September 1991, just a month after TBL made the very first webpage).<br />
<br />
The total triumph of the web as invented at CERN in those years (that is, http + html, etc) makes it easy to forget that at that time, it was one among many competing systems, all trying to figure out how to use the internet for information exchange. Famously, a paper proposed by Tim Berners-Lee and others for a conference on hypertext in 1992 was <a href="http://collab.ecs.soton.ac.uk/ht/jodi/html/jodi11b-mc.html">rejected</a>; Ted Ritchie, who ran OWL which looked just like the web before the Web, has a nice <a href="http://www.ted.com/talks/ian_ritchie_the_day_i_turned_down_tim_berners_lee.html">TED video</a> telling how he was unimpressed by Tim Berners-Lee's idea. So in early 1994, it was not a slam-dunk that the Web would Win. I had already met TBL: he came to Oxford, I think in late 1993 or early 1994, with the proposal that Oxford should be a host of what became the WWW3 consortium. I was present (with Lou, and others) at some of a series of meetings between TBL and various Oxford people. The Oxford reception was not rapturous, and he ended by going to MIT (as this report notes). <br />
<br />
The report is completely un-edited: I've highlighted headings, otherwise the text is exactly as I sent it to -- who? presumably Lou and Harold. And, I still have, and occasionally wear, the conference t-shirt.<br />
<br />
****My trip report****<br />
<h4>
Summary:</h4>
The first World Wide Web conference showed the Web at a moment of transition. <br />
It has grown enormously, on a minimum of formal organization, through the<br />
efforts of thousands of enthusiasts working with two cleverly devised interlocking<br />
systems: the HTML markup language that allows resource to call resource across<br />
networks, and the HTTP protocol that allows computer servers to pass the<br />
resources about the world. However, the continued growth and health of the Web<br />
as a giant free information bazaar will depend on sounder formal organization, and<br />
on its ability to provide efficient gateways to commercial services. The challenge<br />
for the Web is to harmonize this organization and increasing commercialism with<br />
the idealistic energies which have driven it so far, so fast. For the Arts and<br />
Humanities Data Archive, the promise of the Web is that it could provide an ideal<br />
means of network dissemination of its electronic holdings, blending free access with<br />
paid access at many different levels.<br />
<br />
From Wednesday 25th to Friday 27th I braved the clean streets and expensive<br />
chocolates of Geneva, in search of an answer to a simple question: what is World<br />
Wide Web? and what will it be, in a month, a year, a decade?<br />
<br />
Once more, I donned my increasingly-unconvincing impersonation of Lou<br />
Burnard. This time, I was the Lou Burnard of the Arts and Humanities Data<br />
Archive feasibility study, in quest of masses of Arts and Humanities data which<br />
might need a secure, air-conditioned, (preferably clean and chocolate-coated)<br />
environment for the next few millennia. In the past months, WWW has become<br />
possibly the world's greatest data-generator, with seemingly the whole of academe<br />
and parts far beyond scrambling to put up home pages with links to everything. So<br />
my mission: what effect is the enormous success of WWW going to have on the<br />
domains of data archives in particular, and on library practice and electronic<br />
publishing in general?<br />
<br />
The conference was held at CERN, the European Centre for Nuclear Research. <br />
Underneath us, a 27km high-vacuum tunnel in which a positron can travel for three<br />
months at the speed of light without ever meeting another atomic particle. <br />
Ineluctably, the comparison presents itself: one can travel for weeks on WWW<br />
without encountering anything worth knowing. The quality of the Web was a<br />
major subtext of the conference. One thing the Web is not (yet) is any sort of<br />
organization: it is a mark-up language (HTML), a network protocol by which<br />
computers can talk to each other (HTTP) and it is a lot of people rushing in<br />
different directions. Anyone with a computer and an internet account can put<br />
anything on the Web, and they do. In no time at all, the Web has grown, and<br />
grown...<br />
<br />
Here follow questions I took with me, and some answers...<br />
<h4>
How big is the Web?</h4>
From the conference emerged a few remarkable figures on the growth of the Web:<br />
in June 1993 there were 130 Web servers. This had doubled by November 1993 to<br />
272, then doubled again within a month to 623, then doubled again by March this<br />
year to 1265 (source: David Eichmann conference paper 113, 116). Other speakers<br />
at the conference indicated that between March and June this year the number of<br />
servers has doubled and redoubled again, and is probably now somewhere over<br />
5000. Another speaker estimated that there were around 100,000 'artifacts'<br />
(articles, pictures etc etc...) on the Web (McBryan; Touvet put the figure at<br />
'billions'). The volume of information is matched by the level of use. The NCSA<br />
home page currently receives around one million requests a week. Around 500 gb<br />
a week are shifted around the Web, and WWW now ranks sixth in total internet<br />
traffic and will accelerate to first if it continues its current growth. <br />
<br />
<h4>
Who is the Web?</h4>
All this, from an idea in the head of Tim Berners-Lee, with no full-time staff or<br />
organization to speak of. I met (at lunch on Friday) Ari Luotonen, one of the two<br />
staff at CERN who have direct responsibility for running the CERN server, the<br />
first home of the Web, and still its European hub. Simultaneously, they are<br />
running the server, writing software etc. for it, and refining the file transfer<br />
protocol HTTP which underpins the whole enterprise. Before I went, I met Dave<br />
Raggett of Hewlett-Packard, Bristol, who is busy simultaneously inventing the next<br />
incarnation of HTML+ and writing a browser for it, and I spoke several more<br />
times to him during the conference. HTML, the Web's 'native' markup language, is<br />
to the software of the Web what HTTP is to the hardware: without these two,<br />
nothing.<br />
<br />
To one used to the rigours of the Text Encoding Initiative, which has<br />
ascetically divided design from practice in search of truly robust and elegant<br />
solutions, the spectacle of the base standards of the Web being vigorously<br />
redesigned by the same people who are writing its basic software is exhilerating<br />
(there is a lot of: that's a nice idea! I'll have that) and also rather scary. There is a<br />
glorious mix of idealism and hard-headed practicality about the people at the heart<br />
of the Web. This is nowhere better epitomized than in Tim Berners-Lee, whose<br />
address on the first morning had machines manipulating reality, with the Web being<br />
'a representation of human knowledge'. Along the way, Tim tossed out such<br />
treasures as the Web allowing us to play dungeons and dragons in the Library of<br />
Congress, or the Web finding the right second-hand car, or putting together a<br />
skiing party including Amanda. And, he said, you could buy a shirt on the Web.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>If the Web is to continue to grow and prosper, and become more than a<br />
computer hobbyist's super toy, it will need to be useful. Or will it? Can we only<br />
have all human knowledge on the Web if we can buy shirts on it, too? Useful, to<br />
who?<br />
<br />
<h4>
Who uses the Web?</h4>
One paper (Pitkow and Recker) reported the results of a survey on who uses the<br />
Web, based on 4500 responses to a posting on the Web in January 1994: over 90<br />
per cent of its users are male, and 60 % are under 30, and 69% are in North<br />
American universities. One could say that this survey was rather skewed in that the<br />
survey relied on the 'forms' feature in Mosaic, and as few people had access to this<br />
on anything other than Unix 93% of respondents (the same 94 % who were male?)<br />
were Unix/Mosaic users. In any case, the conference was overwhelmingly male,<br />
and strongly American in accent. And the papers at the conference were<br />
overwhelmingly technical: concerned with bandwidth, Z39.50 protocols, proxy<br />
servers, caching, algorithms for 'walking the Web', and so on. In part this<br />
technical bias is necessary. The very success of the Web threatens Internet<br />
meltdown. But extreme technical skill in one area can mean frightening naievety in<br />
another, and vast quantities of naievety were on view over the three days. For<br />
many people, the Web is the world's greatest computer playground: hacker heaven<br />
in 150 countries at once. And so, many of the papers were exercises in ingenuity:<br />
look what I have done! without reflection as to whether it was worth doing in the<br />
first place. But, ingenuity certainly there was. Here are some examples...<br />
<br />
<h4>
How do we find out what is on the Web?</h4>
The sudden and huge size of the Web has created a monster: how do you find just<br />
what you want amid so much, where there are no rules as to what goes where, no<br />
catalogue or naming conventions. The whole of Thursday morning, and many<br />
other papers outside that time, were devoted to 'taming the Web': resource<br />
discovery methods for locating what you want on the Web. It is typical of the<br />
extraordinary fertility of invention which the Web provokes (and this just might be<br />
the Web's greatest gift to the world) that no two of these papers were even<br />
remotely alike in the methods they proposed. One system (SOLO, presented by<br />
Touvet of INRIA) focussed on directory services: indexing all the file names<br />
accessed through the web and constructed a 'white pages' for the Web which maps<br />
all the file names and their servers to resource names, so that one need only point at<br />
the SOLO directory to find an object, whereever it is. Again, something like this is<br />
badly needed as lack of 'name persistence' -- the distressing tendency for files to<br />
disappear from where they are supposed to be -- is a major Web shortcoming. <br />
Fielding (UC at Irvine) outlined a different solution to this particular problem: a<br />
'MOMspider' which would walk the Web and find places where files have been<br />
mislaid, and hence compute the Web's health and alert document owners of<br />
problems. Fielding argued that living infostructures must be actively maintained to<br />
prevent structural collapse: so, who is to do the maintenance? <br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>A quite different approach, though still centred on whole documents, is<br />
ALIWEB, presented by Martijn Koster of NEXOR Ltd. This uses index files<br />
prepared according to standard formats, containing information on who made the<br />
file, what it is, and some key-words, etc. These files must be 'hand-prepared' by<br />
someone at each server site; ALIWEB then collects all these files together and loads<br />
them into a database sent back to all the servers. This is modelled on the working<br />
of Archie, where servers similarly compile and exchange information about ftp<br />
resources, etc. There is no doubting the value of something like ALIWEB. But<br />
how many of the people running those 5000-plus servers (especially, the 4000 or so<br />
which have joined in the last three months) even know of ALIWEB? <br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Another system using 'hand-prepared' data based on whole-document<br />
description was outlined by McBryan (University of Colorado). This is GENVL,<br />
or 'the mother of all bulletin boards'. This uses a very clever system of nested<br />
'virtual libraries' deposited in GENVL by outside users. I can make up my own<br />
'virtual library' of Web resources in my own area of interest using the format<br />
supplied by GENVL. This library can then be nested within the existing GENVL,<br />
and point at yet other virtual libraries supplied by other people. People can find my<br />
library and the documents and libraries it points at by travelling down the<br />
hierarchy, or by WAIS searching on the whole collection. But, who guarantees the<br />
accuracy and timeliness of the data in GENVL? No-one, of course: but it seems on<br />
its way towards being the biggest yellow pages in the world. McBryan's<br />
demonstration of this was far the best demonstration seen at the conference: you can<br />
get into it for yourself by<br />
http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary_html. I must<br />
say the competition for the best demonstration was less intense than might have<br />
been supposed. Indeed, there were more presentational disasters than I have ever<br />
seen at any conference. Many presenters seemed so struck by the wonder of what<br />
they had done that they thought it was quite enough to present their Web session<br />
live, projected onto a screen, in happy ignorance of the fact that the ergonomics of<br />
computer graphics meant that hardly anyone in the auditorium could read what was<br />
on the screen. This enthusiastic amateurism is rather typical of a lot which goes on<br />
in the Web.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Other papers in this strand looked into the documents themselves. The<br />
Colorado system described by McBryan supplements the hand-prepared 'virtual<br />
libraries' by an automated 'Worm' (it appears the Web is breeding its own parasitic<br />
menagerie; in Web-speak, these are all varieties of 'robots') which burrows its way<br />
through every document on the Web, and builds a database containing index entries<br />
for every resource on the Web (as at March 1994, 110,000 resources, 110,000<br />
entries). It does not try to index every word; rather it indexes the titles to every<br />
resource, the caption to every hypertext link associated with that resource, and the<br />
titles of every referenced hypertext node. This is clever, and again was most<br />
impressive in demonstration: you simply dial up the worm with a forms aware<br />
client (Mosaic 2.0, or similar) and type in what you are looking for. Each of<br />
McBryan's systems (GENVL and the Worm) currently receives over 2000 requests<br />
a day. They were far and away the most impressive, and most complete, of the<br />
search systems shown at the conference.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>We met many more of these 'robots' during the conference: beasts bred by<br />
computer science departments to graze their way through the web. They are<br />
creatures of controversy: they can add considerably to the network load and there<br />
are some who would ban all robots from the web. Eichmann of the University of<br />
Houston (Clear Lake) described his 'spider' which again walked the whole Web,<br />
like McBryan's worm. However, his system performs a full-text index of each<br />
Web document, not just an index of titles and hypertext captions. This is a<br />
modified WAIS index which works by throwing away the most frequently<br />
referenced items. However, the Web is so large that the index of even a part of it<br />
soon grew to over 100 megs. The sheer size of the Web, and the many near-<br />
identical documents in some parts of it, can have entertaining consequences. <br />
Eichmann related the sad tale of the time his spider fell down a gravitational well. <br />
It spent a day and a half browsing a huge collection of NASA documents (actually a<br />
thesaurus of NASA terms) and got virtually nothing back you could use in a search:<br />
all it found was NASA, ad infinitum, and so it threw the term away from the index<br />
and .... eventually Eichmann had to rescue his spider. As NASA were sponsoring<br />
this research they were not amused to find that if you asked the spider to find all<br />
documents containing 'NASA' you got precisely nothing.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Every conference on computing must contain a paper with the title 'lost in<br />
hyperspace': this one was from Neuss (Fraunhofer Institute, Germany). I was by<br />
now rather fatigued by techies who thought that running GREP across the Web<br />
actually was a decent search strategy. Neuss did a little more than this (using 'fuzzy<br />
matching') but what he had was essentially our old friend the inverted file index. <br />
Which of course when faced with gigabytes finds it has rather too many files to<br />
invert and so tends to invert itself, turtle-like, which is why both Neuss and<br />
Eichmann found themselves concluding that their methods would really only be<br />
useful on local domains of related resources, etc. At last, after all this we had a<br />
speaker who had noticed that there is a discipline called 'information retrieval',<br />
which has spent many decades developing methods of relevance ranking, document<br />
clustering and vectoring, relevance feedback, 'documents like...', etc. He pointed<br />
out the most glaring weaknesses in archie, veronica, WAIS and the rest (which had<br />
all been rather well demonstrated by previous speakers). However, it appeared he<br />
had good ideas but no working software (the reverse of some of the earlier<br />
speakers).<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Finally, in this strand, the true joker of the pack: De Bra of Eindhoven<br />
University with -- hold your hat -- real-time full-text searching of the web. That<br />
is: you type in a search request and his system acually fires it off simultaneously to<br />
servers all over the Web, getting them to zoom through their documents looking<br />
for matches. It is not quite as crude as this, and uses some very clever 'fishing'<br />
algorithms to control just how far the searches go, and also relies heavily on<br />
'caching' to reduce network load. But it is exactly this type of program which<br />
rouses the ire of the 'no-robots' Web faction (notably, Martijn Koster).<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>So: masses of ways of searching the Web. But is there anything there worth<br />
finding?<br />
<br />
<h4>
How serious is the Web? -- Electronic publishing and the Web</h4>
There were times in these three days when I felt I had strayed into some vast<br />
electronic sandpit. I had come, on the part of the Arts and Humanities archives, in<br />
search of masses of data which might need archiving. What I found instead was a<br />
preoccupation with how the Web worked, and very little interest in the quality of<br />
material on it. On the Wednesday morning, Steffen Meschkat (ART + COM,<br />
Berlin) described his vision of active articles in interactive journals. These would<br />
present a continuous, active model of publication. The boundaries between<br />
reviewed and published articles would disappear, and the articles themselves would<br />
merge with the sea of information. To Meschkat, the idea of publication as a<br />
defining act has no meaning; there is no place in his world for refereed publication,<br />
so important to University tenue committees. There is no doubt that the Web is a<br />
superb place for putting up pre-publication drafts of material for general comment,<br />
and many scholars are making excellent use of it for just that. But to argue that this<br />
is a completely satisfactory model of publication is absurd. A journal whose only<br />
defining strategy is the whims of its contributers (how do you 'edit' such a<br />
journal?) is not a journal at all; it is a ragbag. I am pleased to report that the<br />
centrepiece of Meschkat's talk, a demonstration of the active journal, failed to<br />
work. <br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>In general, I was surprised by the level of ignorance shown about what the<br />
Web would need if it were to become a tool for serious electronic publishing of<br />
serious materials. It is the first conference I have been to where the dread word<br />
copyright hardly appeared. Further, there was hardly any discussion of what the<br />
Web would need to become a satisfactory electronic publishing medium: that it<br />
would have to tackle problems of access control, to provide multilayered and fine-<br />
grained means of payment, to give more control over document presentation. A<br />
SGML group met on Wednesday afternoon to discuss some of these issues, but did<br />
not get very far. There were representatives of major publishers at this group<br />
(Springer, Elsevier) but none yet are doing more than dip a toe in the Web.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>From the talk at the conference, from the demonstrations shown, and what I<br />
have seen of the Web, it is not a publication system. It is a pre-publication system,<br />
or an information distribution system. But it is as far from a publishing system as<br />
the Thursday afternoon free paper is from OUP.<br />
<br />
<h4>
How serious is the Web? free information, local information</h4>
The free and uncontrolled nature of the Web sufficiently defines much of what is<br />
on it. Most of what is on the Web is there because people want to give their<br />
information away. Because it is free, it does not matter too much that the current<br />
overloading of the internet means that people may not actually be able to get the<br />
information from the remote server where it is posted. Nor does it matter that the<br />
rght files have disappeared, or that the information is sloppily presented, eccentric<br />
or inaccurate. These things would be fatal to a real electronic publishing system,<br />
but do not matter in the give-away information world which is much of the Web.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>There is an important category of Web use which does not fit this<br />
description. That is its use for distribution of local information systems: for<br />
carrying the local 'help' or 'info' system for a university, or a company. The most<br />
impressive instances of Web use offered at the conference fitted this model. Here,<br />
we are dealing with information which is valuable but free: valuable, because it is<br />
useful to a definable group of people. Typically, the information is owned by the<br />
local organization and so there are no copyright problems to prevent its free<br />
availability. Further, because the information is most valuable to the local user,<br />
and rather less valuable to anyone beyond, it is quite valid to optimize the local<br />
server and network for local access; those beyond will just have to take their<br />
chances. The Web server currently being set up in Oxford is an excellent example<br />
of the Web's power to draw together services previously offered separately<br />
(gopher, info, etc.) into a single attractive parcel. Several conference papers told<br />
of similar successful applications of the Web to local domains: a 'virtual classroom'<br />
to teach a group of 14 students dispersed across the US, but linked to a single Web<br />
server (Dimitroyannis, Nikhef/Fom, Amsterdam); an on-line company-wide<br />
information system by Digital Equipment (Jones, DEC). Most impressive was the<br />
PHOENIX project at the University of Chicago (Lavenant and Kruper). This aims<br />
to develop, on the back of the Web, a full teaching environment, with both students<br />
and teachers using the Web to communicate courses, set exercises, write papers etc. <br />
Three pilot courses have already been run, with such success that a further 100<br />
courses are set to start later this year. Because it is being used for formal teaching,<br />
PHOENIX must include sophisticated user-authentication techniques. This is done<br />
through the local server; a similar method (though rather less powerful) is in use at<br />
the City University, London (Whitcroft and Wilkinson).<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Such local applications resolve the problem of access to remote servers<br />
which threatens to strangle the web (and with it, the whole Internet). They also<br />
solve the problems of accountability and access: someone is responsible for the<br />
information being accurate and available. As for access: if restriction is required,<br />
this can be done through the local server, so avoiding all the problems of trying to<br />
enforce control through servers on the other side of the world. The Web is so well<br />
suited to these local information services that these alone will see the Web continue<br />
to burgeon. Much of the Web will be a network of islands of such local systems,<br />
upon which the rest of the world may eavesdrop as it will. Among these islands,<br />
we can expect to see many more distance learning providers, on the model of<br />
PHOENIX but distributed across world-wide 'virtual campuses'. Butts et al.<br />
described the 'Globewide Network Academy' (GNA), which claims to be the<br />
world's first 'virtual organization', and appears to intend to become a global<br />
university on the network. This seems rather inchoate as yet. But already 2500<br />
people a day access GNA's 'Web Tree'; where they are going, others will certainly<br />
follow. We could also expect to see many more museums, archives, libraries, etc.<br />
using the Web as an electronic foyer to their collections. There was only one such<br />
group presenting at the conference, from the Washington Holocaust museum<br />
(Levine): that there should be only this one tells us a great deal about the lack of<br />
computer expertise in museums. The Web is an ideal showplace for these. Expect<br />
to see many more museums,etc., on the Web as the technology descends to their<br />
range.<br />
<br />
<h4>
How can I make sure I can get through to the Web?</h4>
For days on end, I have tried to get through to McBryan's server at Colorado, or<br />
the NCSA home page, or indeed anywhere outside the UK. Unless you are<br />
prepared to time it for a moment when all the rest of the world is in bed (and you<br />
should be too), you cannot get through. This alone appears as if it could be doom<br />
for the Web's aspirations. Certainly, it could not be a commercial operation at this<br />
basis: noone is going to pay to be told that the server they want is unavailable. Of<br />
course, this does not matter if all you want is local information, as described in the<br />
last section. But this is far from the vision of the world talking to itself promoted<br />
so enthusiastically by Berners-Lee and others.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Help is at hand. A surprising number of papers were devoted to problems<br />
of making the Web work. The answer is simple: if lack of network bandwidth<br />
means we cannot all dial into America at once, set up clones of America all over the<br />
world. There are many ways of doing this: 'mirror' servers, etc, as long known in<br />
the FTP world, for example. There are obvious problems with mirror servers in<br />
the fast-moving world of the Web, with some documents both changing and being<br />
accessed very frequently. The method that seems most enthusiastically pursued is<br />
'caching', otherwise 'proxy' servers. Instead of dialling America direct, you dial<br />
the proxy server. The server checks if it has a copy of the document you want; if it<br />
has, then it gives you the document itself. If it has not, it gets the document from<br />
America, passes it to you and keeps a copy itself so the next person who asks for it<br />
can get it from the cache. Papers by Luotonen, Glassman, Katz and Smith<br />
described various proxy schemes, several of them based on Rainer Post's Lagoon<br />
algorithms. It is the sort of topic which fascinates computer scientists, with lots of<br />
juicy problems to be solved. For example: once someone has dialled into a proxy<br />
and then starts firing off http calls from the documents passed over by the proxy,<br />
how do we make sure all those http calls are fed to the proxy and not to the 'home'<br />
servers whose address is hard-wired into the documents? or, how do you operate a<br />
proxy within a high-security 'firewall' system, without compromising the security? <br />
how do you make sure the copy the proxy gives you is up to date? and so on.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>These systems work, and work well: since I have caught on to dialling the<br />
Web via the HENSA proxy at Kent, life has suddenly become possible. An<br />
interesting statistic, drawn from Smith of Kent's talk: over an average week, it<br />
seems that around two thirds of the calls to the Web by all those hundreds of<br />
thousands of people are to some 24,000 documents (about a quarter of the total on<br />
the Web, as of May), amounting to just 450 Meg of data. The Web is both<br />
enormous, and surprisingly small. One point is clear: proxies and similar devices<br />
can only work given sophisticated co-operation between servers.<br />
<br />
<h4>
Has the Web a future?</h4>
If the Web stays as it is, it will slowly strangle itself with too much data, too little<br />
of which is of any value, and with even less of what you want available when you<br />
want it. I had more than a few moments of doubt at the conference: there were too<br />
many of the hairy-chested 'I wrote a Web server for my Sinclair z99 in turbo-<br />
Gaelic in three days flat' to make me overwhelmingly confident. The Web will not<br />
stay as it is. The Web has grown on the enthusiastic energies of thousands of<br />
individuals, working in isolation. But the next stages of its growth, as the Web<br />
seeks to become more reliable and to grow commercial arms and legs, will require<br />
more than a protocol, a mark-up language and enthusiasm. Servers must talk to<br />
one another, so that sophisticated proxy systems can work properly. Servers will<br />
have to make legal contracts with one another, so that money can pass around the<br />
Web. Richer mark-up languages, as HTML evolves, will require more powerful<br />
browsers, and create problems of standardization, documentation and maintenance. <br />
All this will require co-operation, and organization. The organization is on the<br />
way. After much hesitation, CERN has decided to continue to support the Web,<br />
and the EC is going to fund a stable European structure for the Web, probably<br />
based at CERN and possibly with units in other European centres (including,<br />
perhaps, Oxford). Berners-Lee himself is going to MIT, where he is going to work<br />
with the team which put together XWindows.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>What will the Web look like, then, in a few years? The momentum for<br />
much of the Web, at least, to remain free and uncontrolled (even, anarchic) will<br />
continue: it will become the world's free Thursday afternoon paper, a rag-bag of<br />
advertizing, vanity publishing, plain nonsense and serious information. There will<br />
be many local islands of information -- help systems, archive galleries, educational<br />
material. Eccentrics, like the Norwegian family who have put their whole house on<br />
the Web so you can dial up a video camera in their living room and watch what<br />
they are doing, will continue to find a home on the Web (Ludvigsen). But<br />
increasingly the Web will provide a gateway to other, specialist and typically<br />
commercial services: electronic publishing, databases, home shopping, etc. The<br />
presence of these satellite services will relieve the Web itself from the need to be all<br />
things to all people, and allow it to continue to be the lingua franca to the<br />
electronic world -- demotic, even debased, but easy, available and above all, free. <br />
The Web will not be the only gateway to services which can do what the Web<br />
cannot, but it may well be the biggest and the most widely used. For organizations<br />
like Universities, libraries, archives of all sorts (and, especially, the AHDA), which<br />
give some information away and sell other information, the Web and its satellites<br />
will be an ideal environment. This aspect of the Web has been barely touched, to<br />
now. But it will happen, and it is our business to make sure it happens the right<br />
way.<br />
<br />
<h4>
What have I not described?</h4>
Even this rather long account omits much. Briefly, a few things I have not<br />
covered:<br />
1. HTML tools: there was discussion about HTML and its future (Raggett; various<br />
workshops); several HTML editors were described (Williams and Wilkinson;<br />
Kruper and Lavenant; Rubinsky of SoftQuad announced HotMeTaL); various<br />
schemes for converting documents to HTML were outlined (from FrameMaker:<br />
Stephenson; Rousseau; from Latex: Drakos).<br />
2. Connectivity tools for linking the Web to other resources: to CD-ROMs<br />
(Mascha); to full-featured hypertext link servers (Hall and the Microcosm team); to<br />
computer program teaching systems (Ibrahim); to databases (Eichmann)<br />
and much, much more.<br />
<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>Thirty years after McLuhan, the global village is here.<br />
<div>
<br /></div>
PeterRobinsonhttp://www.blogger.com/profile/11407068137474574132noreply@blogger.com0