Sunday 17 September 2023

Setting up the revised Collation Editor: file structures

 In an earlier post, I explained some of the history behind the Collation Editor, and our use of it in Textual Communities.  At last, I am updating the Collation Editor embedded into TC!

The Collation Editor has two major dependencies:

  1. On Python, for a series of critical tasks run through a Python server;
  2. On CollateX, for the actual collation.

The first task was to create a version of the Collation Editor Core implementing both dependencies. I did this by mirroring the structure of the stand-alone collation editor code (available at https://github.com/itsee-birmingham/standalone_collation_editor). Thus, this is what the top-level folder looks like in my implementation (in my installation, in /Applications/Collation_Editor_Core):

That is: at the root level I have a folder holding collateX, with the collatex-tools jar in it. There is a folder labelled "collation" which we will look at in a moment. There are two python files, and then a .sh and .bat file which start up the application (this structure is taken from the current stand-alone collation editor structure).

Within the "collation" folder, here is what I have:

And then, going still deeper, this is the content of the "core" folder:
You see here a series of .py files, all needed for the link to Python to work. However, we need to have an index.html file in place to run the instance. The index.html file is actually contained within the "collation/static" folder, as follows:
Here is what the index.html file has, in this starter configuration:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=8" />
  <title>Collation Editor</title>
  <meta name="description" content="Collation and Apparatus Editor" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge" />
  <script>
    var SITE_DOMAIN = "http://localhost:8080";
    var staticUrl = SITE_DOMAIN + '/collation/';
  </script>
  <script type="text/javascript" src="/collation/js/jquery-3.3.1.min.js"></script>
  <script type="text/javascript" src="/collation/js/jquery-ui.min.js"></script>
<link rel=stylesheet href="/collation/pure-release-1.0.0/pure-min.css"  
type="text/css"/>
  <script type="text/javascript" src="/collation/CE_core/js/collation_editor.js"></script>
  <script type="text/javascript">
    var servicesFile = 'js/local_services.js';
    collation_editor.init();
  </script>
</head>
<body oncontextmenu="return false;">
<div id="header" class="collation_header">
<h1 id="stage_id">Collation</h1>
<h1 id="project_name"></h1>
<div id="login_status"></div>
</div>
<div id="container">
<p>Loading, Please wait.</p>
<br/>
<br/>
</div>
  <div id="footer"></div>
<div id="tool_tip" class="tooltip"></div>
</body>
</html>

Note that the "src" and "href" attributes direct to "/collation..." not to "collation..". The preceding "/" is important as this sends the server to look for these files in the root "collation" folder.

With this structure we can start up an instance of the Collation Editor with both Python and CollateX in place by going to the terminal, moving into the root directory thus:
cd /Applications/Collation_Editor_Core

And then starting up the instance with

./startup.sh

This calls Python 3 to start a server at localhost:8080, with the "collation" folder as the root, and running the Python .py files in the "collation/core" folder. It also starts up CollateX, from the "collatex" folder at the root, with CollateX running on another. If all is in place, this is what you will see when you go to "http://localhost:8080/collation/" in your browser.

If you have the "data" folder from the stand-alone installation in the "collation" folder, you can type in "B04K6V23" into the "Select" box and then hit the "Collate Project Witnesses" button (currently not working ...)

 

Setting up the revised Collation editor: some history (2023)

 I am a huge fan of the "Collation Editor", built by Cat Smith of the Institute for Textual Scholarship and Electronic Editing (ITSEE) at the University of Birmingham, with substantial input from Troy Griffitts, now at The Göttingen Academy of Sciences and Humanities in Lower Saxony. Some history is required. The roots of the Collation Editor lie in my Collate software, written for the Macintosh computer from 1989 on and, in its day, used heavily by multiple editing projects. Notable among these user projects were two groups editing Biblical texts: those associated with the Institute for New Testament research at Münster, Germany (INTF), and David Parker and scholars working with him at the University of Birmingham (now, ITSEE). 

Part of the story of how Collate begat CollateX, and CollateX begat the Collation Editor, is told in other blogs on this site: https://scholarlydigitaleditions.blogspot.com/2014/09/the-history-of-collate.html and https://scholarlydigitaleditions.blogspot.com/2014/09/collate-2-and-design-for-its-successor.html. These blogs, though here dated 2014, were written in 2007. Other parts can be deduced from an article about the evolution of digital methods in the INTF and ITSEE written by myself, David Parker, Hugh Houghton and Klaus Wachtel (you can read that article at my Academia site, or via its DOI). 

The first part of this begetting is the making of CollateX. CollateX fulfilled completely the first part of the agenda I laid out in the blogs on this site: to create a system for comparison of multiple texts which was modular and independent of any one hardware or software implementation. CollateX is a marvel, and a remarkable achievement by the team of software engineers who made it (prominently, Ronald Dekker of the Huygens Institute, Amsterdam). 

The second part of this begetting was the making of the Collation Editor. This creates an entire environment permitting editors to create exactly the collation they want, by determining through a point-and-click interface exactly what words collate with what and how the collation is to be expressed. Essentially, the Collation Editor is an interface to, and an extension of, CollateX: permitting editors to adjust the CollateX collations to create exactly the collations they want. For me, the test of the Collation Editor, and its implementation of CollateX, was simple: could we achieve exactly the same complex collations with the Collation Editor/CollateX as we could, from 1995 to around 2015, with Collate? The answer is, triumphantly, yes. Indeed, we could achieve far more with the Collation Editor than we ever could with Collate. Here is the tool I dreamed of in 2007. (Somewhere, I said that it would take a team of ten people ten years to make the replacement for Collate. I was not far wrong).

Accordingly, in 2016 I started work on integrating the Collation Editor into Textual Communities. We have now used this integrated implementation to collate some four thousand lines of the Canterbury Tales, in preparation of our forthcoming Critical Edition of the Tales. You can see how this works in a video I made, collating just one line of the Tales. As you can see, the Collation Editor can create exactly the highly-complex collations we want. In the last years, it has become an absolutely vital part of our work on the Tales. However, the version we integrated in 2016, and which is still the version we are using, is now seriously outdated. Many improvements have been made to the Collation Editor since 2016 (or, in effect, 2019, when we last updated our implementation of the Collation Editor) and finally, thanks to a sabbatical, I am setting out to bring the Textual Communities version of the Collation Editor up to date. This task should be greatly eased by the re-organization and rewriting of the Collation Editor since 2019. The Collation Editor code has now been cleanly divided into a "core" code library, designed so that the whole core can fit inside any implementation and be easily updated, and a "services" code library, which connects the core to whatever implementation you want. In our case, we use MongoDB document databases to store all our information about our texts, and hence everything the Collation Editor needs to function should be linked to our MongoDB databases.

In the next posts, I will explain how I went about setting up the updated core collation tools of the Colllation Editor to work within Textual Communities, in the same way as a series of blogs on StaticSearch explain how I got this to work with our data.



Sunday 6 August 2023

In praise of staticSearch

Over the last few weeks, I have worked intensively with staticSearch to integrate it into our forthcoming publication of Edvige Agostinelli and Bill Coleman's digital edition of  Boccaccio's Teseida. You can see the near-complete prototype at http://inklesseditions.com/TeseidaStatic/. Note that this is still a prototype only, and the address will change as we move to full publication; please DO NOT repost this link on the open internet. Among other matters, this is "Endings"-conformant in its principal components: see https://scholarlydigitaleditions.blogspot.com/2023/07/the-endings-project-and-canterbury.html

As you can see (try searching for "come" by typing it in the search box and hitting return, or the search icon) staticSearch works beautifully here. And hence my final word on staticSearch. This is a quite wonderful tool. It is lightning fast, easy to set up, and works like a dream. As a true Endings tool, it has no dependencies on any outside systems of any kind. Take a bow, Martin Holmes and Joey Takeda (and everyone else who has contributed). Great work.




Sunday 30 July 2023

Setting up staticSearch for our projects: nested files and multiple search entry points

 I now realize (a week later!), after looking at the staticSearch projects listed in the documentation two things I did not know before,  two things where our projects differ (it seems) from all other StaticSearch implementations to date:

  1. staticSearch assumes (or at least, all the listed projects appear to follow this model) that all the pages to be searched are held in the same folder as the root index.html folder. Indeed, a 2019 presentation by the staticSearch team explictly declares that "All pages live together in the same folder" and, furthermore, "We don't care" if that means there are 10,591 files in that one folder.
  2. staticSearch assumes (or at least, all the listed projects appear to follow this model) that all searches are launched from a single place, and a single file, contained in that same folder holding all the project files.
Neither of these assumptions hold good for our projects. I anticipate that the Canterbury Tales Project when complete (!) will require somewhere around 90,000 distinct html files: one for each of the 29,000 manuscript pages in which the Tales occur; three files for each of the some some 20,000 entities (lines of poetry, blocks of prose) which constitute the text of the Tales. I, for one, am not comfortable with around 90,000 files in a single folder. We devised a uniform directory structure to hold all these files. The transcript of folio 1r in Hengwrt is held in "html/transcripts/Hg/1r.html"; the collation of the first line of the General Prologue is held in "html/collations/GP/1.html". By design, then, all our html files are buried four layers below the "home" folder holding our index.html file.

In fact, we discovered that the '<recurse>true</recurse>' statement in the configuration files means that staticSearch has no problem at all with nested directories. It duly finds and indexes all our html pages. But the second issue -- that the default staticSearch configuration expects that all searches will be run from a single file, located in the project home directory -- does cause problems. We could, quite easily, have set up our projects the same way as staticSearch expects, so that clicking on a "search" icon or similar on each of the 90,000 pages would send the reader to a single search page, presumably in the home directory. But we did not want to do that. Here is how the header for one of our project pages looks (for folio 72r of the Naples manuscript of the forthcoming Agostinelli/Coleman edition of Boccaccio's Teseida looks:

A fundamental principle of our edition design is "have only the pages you really need". We want our readers to be able to run the search directly from the page they are looking at, and not have to go to any other page to do the search. Further, we want the header on all our pages to look the same, following another mantra: "keep everything as uniform as possible across the whole edition". This meant that every one of our (possibly) 90,000 html pages would have a search box on it, as you can see in the top right of this image. This means too that searches would not always begin from a file located at the root of project folder. Indeed, all searches except those run from the index.html starting point to our editions would begin from a file nested four layers deep in the project folder. And that is why we found the problems with folder paths referred to in the previous post.

I will post a suggestion in the staticSearch issues forum as to how staticSearch itself could help projects configured like ours, with many files spread over multiple folders and each file being a search access point. In a final post in this series, I offer some general thoughts about staticSearch.

Monday 24 July 2023

Setting up staticSearch for our project: the search results

 In the last post, we integrated the search page into the header of our pages. Now, we need to deal with the search results.  staticSearch by default places the search results in the <div id="ssResults"> element. But as part of our set up, we hid that element. Instead, we want the search results to appear in a different place on the page: in a <div id="searchContainer">. So how do we do this?

staticSearch has anticipated that users might want to intervene at the end of the search process to adjust how and where the search results appear. You can find this out by digging into the ssSearch-debug.js file which staticSearch helpfully makes available (it is in the staticSearch folder which Ant makes in your project folder). In it you will find references to a "searchFinishedHook" function, created explicitly as a hook where developers such as me can get at the results of the search and manipulate them before they are seen by the user. The definition of searchFinishedHook is left open in the staticSearch initialization:           

       this.searchFinishedHook = function(num){};

 Accordingly we can redefine searchFinishedHook to let us do what we want to the search results. In this piece of code in the ssInitialize.js file in the staticSearth folder, we define our own searchFinishedHook function:

     window.addEventListener('load', function() {Sch = new StaticSearch();
          Sch.searchFinishedHook = function (num) {
                    $("#splash").hide();
                    $("#rTable").hide();
                    $("#searchContainer").html($("#ssResults").html());
                    $("#searchContainer").show();
            }
       });

The first two lines hide the "splash" and "rTable" elements on the page. The next line copies all the search results for the hidden "ssResults" element into the "searchContainer" element, and the last line shows that element. Here is what it looks like for a simple search:


For a first, straight out-of-the-box effort, this is really impressive! And also lightning fast.  You can see too that staticSearch has set us up with links to each page. In our project, we have the base index.html at the root of our project folder. All the files with transcripts of each page are in a folder labelled "html", with a subfolder "transcripts', a subfolder for each manuscript, and finally the html for each page. Thus the transcript for folio 1r of manuscript AUT is in html/transcripts/AUT/1r.html.

staticSearch knows that at each page it indexes is in the folder "html/transcripts/..." relative to the index.html file. Accordingly, staticSearch creates a link to each page as (for example) href="html/transcripts/AUT/1r.html".  The links to files works fine for searches from our index.html file. Following standard web protocols, the path "html/transcripts/AUT/1r.html" is appended to the path to the index.html file (e.g. "https://www.inklesseditions.com/teseida/," thus becoming "https://www.inklesseditions.com/teseida/html/transcripts/AUT/1r.html".

However, we want to do searches not just from the root index.html file, but also from every transcript file. That is: the calling file for each search is NOT at the root directory, where the index.hmtl file is, but in a directory nested several layers deep within the root. For example, a search from the transcript file for page 1r of the AUT manuscript will be sent from html/transcripts/AUT/1r.html. This affects all the file calls needed and created for staticSearch. Accordingly, we have to make multiple adjustments within the file 1r.html to prepare it for static search:

        The calls to ssinitialize.js and ssSearch.js have to go to ../../../staticSearch/ssinitialize.js and
          ../../../staticSearch/ssSearch.js (the same for ssSearch-debug.js if you are using that instead)

         The attribute value @data-ssfolder on the form id="ssForm" which runs the search has to be set 
          to ../../../staticSearch and not just staticSearch

With these adjustments, the search from 1r.html runs just fine. But we have a problem with the links to the pages in the search results. It is this: because staticSearch does not understand that the 1r.html file is buried in html/transcripts/AUT/ it fails to create valid links in the search results to the files containing the search hits. For example: the link from html/transcripts/AUT/1r.html to 1r.html of the NO manuscript should be html/transcripts/NO/1r.html.  Instead, StaticSearch links to         
           html/transcripts/AUT/html/transcripts/NO/1r.html

That is: it concatenates the file path for html/transcripts/NO/1r.html with the path for html/transcripts/AUT/. The path should actually be '../../../html/transcripts/NO/1r.html'. 

It is not too difficult to fix these incorrect paths by calling a function to rewrite these internal links in our override of the searchFinishedHook  function. But this is somewhat ugly. A better solution would be to have staticSearch recognize where it is functioning from in the file-system and adjust links accordingly in the Ant process. In the next post, I explore how these problems have arisen and what might be done about it.


Sunday 23 July 2023

Setting up staticSearch for our project: integrating the search box into our pages

 In earlier posts, I describe the background to the decision to use staticSearch and my experience of getting it to work. In this post, I describe how we are winding staticSearch into our editions.

By default, staticSearch places everything it uses into a <div id="staticSearch"> element. So your core search page, typically the "index.html" file at the root of your document collection, has to contain a <div id="staticSearch"> </div> element. When you run the Ant process, as described in https://scholarlydigitaleditions.blogspot.com/2023/07/staticsearch-and-me.html, the <div id="staticSearch"> gets populated with multiple javascript and html statements. Here is the beginning of what staticSearch pastes in, as of version 1.4.4:

<div id="staticSearch"> 
<script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssSearch-debug.js"></script> 
<script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssInitialize.js"></script>

<noscript xmlns="http://www.w3.org/1999/xhtml">This page requires JavaScript.</noscript>

<form xmlns="http://www.w3.org/1999/xhtml" accept-charset="UTF-8" id="ssForm" 

data-allowphrasal="yes" data-allowwildcards="yes" data-minwordlength="2" 

data-scrolltotextfragment="no" data-maxkwicstoshow="5" data-resultsperpage="5" 

onsubmit="return false;" data-versionstring="" data-ssfolder="../../../staticSearch"

 data-kwictruncatestring="..." data-resultslimit="2000">

<span class="ssQueryAndButton">

<input type="text" id="ssQuery" aria-label="Search"/>

<button id="ssDoSearch">Search</button>

</span>

</form>

This fragment sets up the search form, which will appear where-ever you have put <div id="staticSearch"> in your document. In our implementation, we place it in the document header, where we want this to take up very little space. In the default implementation, the <form id="ssForm"> is followed by two other elements, thus:

<div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching" >Searching...</div>

<div xmlns="http://www.w3.org/1999/xhtml" id="ssResults"></div>

<div xmlns="http://www.w3.org/1999/xhtml" id="ssPoweredBy"> //ssLogo etc

If we keep this as is, here is what the top of our page looks like:


This is rather ugly, as "ssSearching" and ""ssPoweredBy" elements intrude on our very clear header. So we can suppress those by adding 'style="display:none"' to those elements. We will also add 'style="display:none"' to the "ssResults" element: more on that in a moment. Thus:

 <div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching"
                                                                        style="display:none" >Searching...</div>

Now, this is how it looks once those elements have been hidden:


This could be better yet. Space in the header is at a premium: every letter counts. Instead of "Search", taking up rather a large box, we might have just a "?" or, even better, a search icon.S So we adjust the appearance of the "Search" button with this bit of css:

  #ssDoSearch {
       background-image: url("../../common/images/searchicon.png");
       background-size: 15px;
       height: 24px;
       width: 23px;
       background-position: 50%;
       background-repeat: no-repeat;
       position: relative;
       top: 5px;
   }
 #ssQuery {
      width: 100px;
  }

The #ssQuery also makes the search box a bit narrower. So now it looks so:

This is beginning to look fine.  Now, let's see what the search results look like in the next post.





Thursday 20 July 2023

staticSearch and Me: getting started

 In another post, I explain the background to my work in making digital scholarly editions in relation to the Endings project, and how this led me to staticSearch. In this post, I describe my use of staticSearch, in hope that it might help others who, like me, want to include a seach engine in their online resource. I am using a Macintosh MacBook Pro, running Ventura 13.4 in July 2023 as I write this.

The documentation for staticSearch is at https://endings.uvic.ca/staticSearch/docs/index.html. The first step is to download the source code at https://github.com/projectEndings/staticSearch/releases/. This should arrive on your computer as a zip file named staticSearch-1.4.4.zip, or similar. Just double-click to unpack the zip file into a folder named staticSearch-1.4.4. Move that folder somewhere convenient from the downloads folder: easiest and simplest to put it all in your Applications folder.

Before you can do anything more: you need Apache Ant. This is a tool designed to build complex software projects from source. Ant will read a set of instructions: get this file! rebuild it this way! save it to this file! now get another sofrware process to use that file to convert other files into something else! and then create new objects (new files, new tools, new libraries) from those files! etc. etc. Section 7.7 of the StaticSearch document says laconically:

    Note: you will need Java and Apache Ant installed, as well as ant-contrib.

You should have Java already, in an up-to-date distribution, as part of your computer. But you may need to get Apache Ant. You get it from https://ant.apache.org/srcdownload.cgi. Look for the latest version: in July 2023, this was 1.9.16. This requires Java 5, which you should already have. Download the zip file, double-click to unpack it to a folder named apache-ant-1.10.13 (or similar). As before, move that folder into your Applications folder.

You also have to get ant-contrib. This is a little more complex. What you actually need are two Java .jar files, named "cpptasks-1.0b5.jar" and "ant-contrib-1.0.jar". It took me a while to figure this out. The Apache ant-contrib page gives you the source for cpptasks, and someone with more expertise and time than me could (I suppose) compile the source into a Java .jar. But I took the short-cut and found a copy of cpptasks-1.0b5.jar out there on the net (in my case, at https://jar-download.com/artifacts/ant-contrib/cpptasks/1.0b5#google_vignette). I found ant-contrib-1.0.jar at http://www.java2s.com/Code/Jar/a/Downloadantcontrib10jar.htm.

Once you have these .jar files: place them both in the lib directory of your Apache Ant folder. 

You are now ready to test out staticSearch. Here's what you do:

  1. Open the Macintosh terminal application. You will find this in your Applications/Utilities folder. This is a good old-fashioned command-prompt system, like we all used back in the 80s (remember the 80s? Wham? Freddy Mercury? yes, those). 
  2. In the terminal: move into your static search folder. If you have unpacked it into Applications as "staticSearch-1.4.4" you should type "cd /Applications/staticSearch-1.4.4" into the terminal
  3. Now you are ready to test out all is working. For this you have to run Ant. You do this with the following command at the terminal "/Applications/apache-ant-1.10.13/bin/./ant" (assuming you have got Ant in a directory named apache-ant-1.10.13 inside Applications"). If all is installed correctly you should see a lot of things on the screen and, finally, a triumphant "BUILD SUCCESSFUL" message comes up. (If you are smarter than I am you might be able to edit the $PATH statement in your terminal profile so that you just need to type "ant" into the terminal, and not "/Applications/apache-ant-1.10.13/bin/./ant". It seems Apple do not want you to edit your terminal profile, and are making this rather difficult: see https://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac.)
Now, try it with your own HTML. The staticSearch documentation is excellent. I created a folder called "mystuff"inside the staticSearch folder. In this folder I put all my html, itself in another folder called "html". I had an index.html file in the root of the mystuff folder and I had an xml file called "ssconfig.xml" containing the key instructions directing staticSearch to work on my html:

<config xmlns="http://hcmc.uvic.ca/ns/staticSearch">
 <params>
<searchFile>index.html</searchFile>
<recurse>true</recurse>
 </params>
</config>

I now ran staticSearch on my material with this command: 

/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=/Applications/staticSearch-1.4.4/mystuff/ssconfig.xml
(I could also have used just "/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=mystuff/ssconfig.xml" as I am already in the staticSearch folder)

The first time I tried this, it did not work. It turns out that the <params> declaration needs a whole lot more it it or you get a failed build. <params> needs to contain declarations as follows:
       <phrasalSearch>true</phrasalSearch>
        <wildcardSearch>true</wildcardSearch>
        <createContexts>true</createContexts>
        <resultsPerPage>5</resultsPerPage>
        <minWordLength>2</minWordLength>
        <maxKwicsToHarvest>5</maxKwicsToHarvest>
        <maxKwicsToShow>5</maxKwicsToShow>
        <totalKwicLength>15</totalKwicLength>
        <kwicTruncateString>...</kwicTruncateString>
        <verbose>false</verbose>
        <stopwordsFile>test_stopwords.txt</stopwordsFile>
        <dictionaryFile>english_words.txt</dictionaryFile>
        <indentJSON>true</indentJSON>
It turns out that this issue is a part of a wider discussion in the SS community on what needs to be declared in the set-up, and what can be set as defaults. See the discussion in the comments on https://github.com/projectEndings/staticSearch/issues/270, where I first reported my experience, and on https://github.com/projectEndings/staticSearch/issues/195, where the wider discussion takes place.

Now that I had staticSearch running: the next step was to start integrating it into our own HTML. That's the subject of the next post.