Sunday 30 July 2023

Setting up staticSearch for our projects: nested files and multiple search entry points

 I now realize (a week later!), after looking at the staticSearch projects listed in the documentation two things I did not know before,  two things where our projects differ (it seems) from all other StaticSearch implementations to date:

  1. staticSearch assumes (or at least, all the listed projects appear to follow this model) that all the pages to be searched are held in the same folder as the root index.html folder. Indeed, a 2019 presentation by the staticSearch team explictly declares that "All pages live together in the same folder" and, furthermore, "We don't care" if that means there are 10,591 files in that one folder.
  2. staticSearch assumes (or at least, all the listed projects appear to follow this model) that all searches are launched from a single place, and a single file, contained in that same folder holding all the project files.
Neither of these assumptions hold good for our projects. I anticipate that the Canterbury Tales Project when complete (!) will require somewhere around 90,000 distinct html files: one for each of the 29,000 manuscript pages in which the Tales occur; three files for each of the some some 20,000 entities (lines of poetry, blocks of prose) which constitute the text of the Tales. I, for one, am not comfortable with around 90,000 files in a single folder. We devised a uniform directory structure to hold all these files. The transcript of folio 1r in Hengwrt is held in "html/transcripts/Hg/1r.html"; the collation of the first line of the General Prologue is held in "html/collations/GP/1.html". By design, then, all our html files are buried four layers below the "home" folder holding our index.html file.

In fact, we discovered that the '<recurse>true</recurse>' statement in the configuration files means that staticSearch has no problem at all with nested directories. It duly finds and indexes all our html pages. But the second issue -- that the default staticSearch configuration expects that all searches will be run from a single file, located in the project home directory -- does cause problems. We could, quite easily, have set up our projects the same way as staticSearch expects, so that clicking on a "search" icon or similar on each of the 90,000 pages would send the reader to a single search page, presumably in the home directory. But we did not want to do that. Here is how the header for one of our project pages looks (for folio 72r of the Naples manuscript of the forthcoming Agostinelli/Coleman edition of Boccaccio's Teseida looks:

A fundamental principle of our edition design is "have only the pages you really need". We want our readers to be able to run the search directly from the page they are looking at, and not have to go to any other page to do the search. Further, we want the header on all our pages to look the same, following another mantra: "keep everything as uniform as possible across the whole edition". This meant that every one of our (possibly) 90,000 html pages would have a search box on it, as you can see in the top right of this image. This means too that searches would not always begin from a file located at the root of project folder. Indeed, all searches except those run from the index.html starting point to our editions would begin from a file nested four layers deep in the project folder. And that is why we found the problems with folder paths referred to in the previous post.

I will post a suggestion in the staticSearch issues forum as to how staticSearch itself could help projects configured like ours, with many files spread over multiple folders and each file being a search access point. In a final post in this series, I offer some general thoughts about staticSearch.

Monday 24 July 2023

Setting up staticSearch for our project: the search results

 In the last post, we integrated the search page into the header of our pages. Now, we need to deal with the search results.  staticSearch by default places the search results in the <div id="ssResults"> element. But as part of our set up, we hid that element. Instead, we want the search results to appear in a different place on the page: in a <div id="searchContainer">. So how do we do this?

staticSearch has anticipated that users might want to intervene at the end of the search process to adjust how and where the search results appear. You can find this out by digging into the ssSearch-debug.js file which staticSearch helpfully makes available (it is in the staticSearch folder which Ant makes in your project folder). In it you will find references to a "searchFinishedHook" function, created explicitly as a hook where developers such as me can get at the results of the search and manipulate them before they are seen by the user. The definition of searchFinishedHook is left open in the staticSearch initialization:           

       this.searchFinishedHook = function(num){};

 Accordingly we can redefine searchFinishedHook to let us do what we want to the search results. In this piece of code in the ssInitialize.js file in the staticSearth folder, we define our own searchFinishedHook function:

     window.addEventListener('load', function() {Sch = new StaticSearch();
          Sch.searchFinishedHook = function (num) {
                    $("#splash").hide();
                    $("#rTable").hide();
                    $("#searchContainer").html($("#ssResults").html());
                    $("#searchContainer").show();
            }
       });

The first two lines hide the "splash" and "rTable" elements on the page. The next line copies all the search results for the hidden "ssResults" element into the "searchContainer" element, and the last line shows that element. Here is what it looks like for a simple search:


For a first, straight out-of-the-box effort, this is really impressive! And also lightning fast.  You can see too that staticSearch has set us up with links to each page. In our project, we have the base index.html at the root of our project folder. All the files with transcripts of each page are in a folder labelled "html", with a subfolder "transcripts', a subfolder for each manuscript, and finally the html for each page. Thus the transcript for folio 1r of manuscript AUT is in html/transcripts/AUT/1r.html.

staticSearch knows that at each page it indexes is in the folder "html/transcripts/..." relative to the index.html file. Accordingly, staticSearch creates a link to each page as (for example) href="html/transcripts/AUT/1r.html".  The links to files works fine for searches from our index.html file. Following standard web protocols, the path "html/transcripts/AUT/1r.html" is appended to the path to the index.html file (e.g. "https://www.inklesseditions.com/teseida/," thus becoming "https://www.inklesseditions.com/teseida/html/transcripts/AUT/1r.html".

However, we want to do searches not just from the root index.html file, but also from every transcript file. That is: the calling file for each search is NOT at the root directory, where the index.hmtl file is, but in a directory nested several layers deep within the root. For example, a search from the transcript file for page 1r of the AUT manuscript will be sent from html/transcripts/AUT/1r.html. This affects all the file calls needed and created for staticSearch. Accordingly, we have to make multiple adjustments within the file 1r.html to prepare it for static search:

        The calls to ssinitialize.js and ssSearch.js have to go to ../../../staticSearch/ssinitialize.js and
          ../../../staticSearch/ssSearch.js (the same for ssSearch-debug.js if you are using that instead)

         The attribute value @data-ssfolder on the form id="ssForm" which runs the search has to be set 
          to ../../../staticSearch and not just staticSearch

With these adjustments, the search from 1r.html runs just fine. But we have a problem with the links to the pages in the search results. It is this: because staticSearch does not understand that the 1r.html file is buried in html/transcripts/AUT/ it fails to create valid links in the search results to the files containing the search hits. For example: the link from html/transcripts/AUT/1r.html to 1r.html of the NO manuscript should be html/transcripts/NO/1r.html.  Instead, StaticSearch links to         
           html/transcripts/AUT/html/transcripts/NO/1r.html

That is: it concatenates the file path for html/transcripts/NO/1r.html with the path for html/transcripts/AUT/. The path should actually be '../../../html/transcripts/NO/1r.html'. 

It is not too difficult to fix these incorrect paths by calling a function to rewrite these internal links in our override of the searchFinishedHook  function. But this is somewhat ugly. A better solution would be to have staticSearch recognize where it is functioning from in the file-system and adjust links accordingly in the Ant process. In the next post, I explore how these problems have arisen and what might be done about it.


Sunday 23 July 2023

Setting up staticSearch for our project: integrating the search box into our pages

 In earlier posts, I describe the background to the decision to use staticSearch and my experience of getting it to work. In this post, I describe how we are winding staticSearch into our editions.

By default, staticSearch places everything it uses into a <div id="staticSearch"> element. So your core search page, typically the "index.html" file at the root of your document collection, has to contain a <div id="staticSearch"> </div> element. When you run the Ant process, as described in https://scholarlydigitaleditions.blogspot.com/2023/07/staticsearch-and-me.html, the <div id="staticSearch"> gets populated with multiple javascript and html statements. Here is the beginning of what staticSearch pastes in, as of version 1.4.4:

<div id="staticSearch"> 
<script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssSearch-debug.js"></script> 
<script xmlns="http://www.w3.org/1999/xhtml" src="staticSearch/ssInitialize.js"></script>

<noscript xmlns="http://www.w3.org/1999/xhtml">This page requires JavaScript.</noscript>

<form xmlns="http://www.w3.org/1999/xhtml" accept-charset="UTF-8" id="ssForm" 

data-allowphrasal="yes" data-allowwildcards="yes" data-minwordlength="2" 

data-scrolltotextfragment="no" data-maxkwicstoshow="5" data-resultsperpage="5" 

onsubmit="return false;" data-versionstring="" data-ssfolder="../../../staticSearch"

 data-kwictruncatestring="..." data-resultslimit="2000">

<span class="ssQueryAndButton">

<input type="text" id="ssQuery" aria-label="Search"/>

<button id="ssDoSearch">Search</button>

</span>

</form>

This fragment sets up the search form, which will appear where-ever you have put <div id="staticSearch"> in your document. In our implementation, we place it in the document header, where we want this to take up very little space. In the default implementation, the <form id="ssForm"> is followed by two other elements, thus:

<div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching" >Searching...</div>

<div xmlns="http://www.w3.org/1999/xhtml" id="ssResults"></div>

<div xmlns="http://www.w3.org/1999/xhtml" id="ssPoweredBy"> //ssLogo etc

If we keep this as is, here is what the top of our page looks like:


This is rather ugly, as "ssSearching" and ""ssPoweredBy" elements intrude on our very clear header. So we can suppress those by adding 'style="display:none"' to those elements. We will also add 'style="display:none"' to the "ssResults" element: more on that in a moment. Thus:

 <div xmlns="http://www.w3.org/1999/xhtml" id="ssSearching"
                                                                        style="display:none" >Searching...</div>

Now, this is how it looks once those elements have been hidden:


This could be better yet. Space in the header is at a premium: every letter counts. Instead of "Search", taking up rather a large box, we might have just a "?" or, even better, a search icon.S So we adjust the appearance of the "Search" button with this bit of css:

  #ssDoSearch {
       background-image: url("../../common/images/searchicon.png");
       background-size: 15px;
       height: 24px;
       width: 23px;
       background-position: 50%;
       background-repeat: no-repeat;
       position: relative;
       top: 5px;
   }
 #ssQuery {
      width: 100px;
  }

The #ssQuery also makes the search box a bit narrower. So now it looks so:

This is beginning to look fine.  Now, let's see what the search results look like in the next post.





Thursday 20 July 2023

staticSearch and Me: getting started

 In another post, I explain the background to my work in making digital scholarly editions in relation to the Endings project, and how this led me to staticSearch. In this post, I describe my use of staticSearch, in hope that it might help others who, like me, want to include a seach engine in their online resource. I am using a Macintosh MacBook Pro, running Ventura 13.4 in July 2023 as I write this.

The documentation for staticSearch is at https://endings.uvic.ca/staticSearch/docs/index.html. The first step is to download the source code at https://github.com/projectEndings/staticSearch/releases/. This should arrive on your computer as a zip file named staticSearch-1.4.4.zip, or similar. Just double-click to unpack the zip file into a folder named staticSearch-1.4.4. Move that folder somewhere convenient from the downloads folder: easiest and simplest to put it all in your Applications folder.

Before you can do anything more: you need Apache Ant. This is a tool designed to build complex software projects from source. Ant will read a set of instructions: get this file! rebuild it this way! save it to this file! now get another sofrware process to use that file to convert other files into something else! and then create new objects (new files, new tools, new libraries) from those files! etc. etc. Section 7.7 of the StaticSearch document says laconically:

    Note: you will need Java and Apache Ant installed, as well as ant-contrib.

You should have Java already, in an up-to-date distribution, as part of your computer. But you may need to get Apache Ant. You get it from https://ant.apache.org/srcdownload.cgi. Look for the latest version: in July 2023, this was 1.9.16. This requires Java 5, which you should already have. Download the zip file, double-click to unpack it to a folder named apache-ant-1.10.13 (or similar). As before, move that folder into your Applications folder.

You also have to get ant-contrib. This is a little more complex. What you actually need are two Java .jar files, named "cpptasks-1.0b5.jar" and "ant-contrib-1.0.jar". It took me a while to figure this out. The Apache ant-contrib page gives you the source for cpptasks, and someone with more expertise and time than me could (I suppose) compile the source into a Java .jar. But I took the short-cut and found a copy of cpptasks-1.0b5.jar out there on the net (in my case, at https://jar-download.com/artifacts/ant-contrib/cpptasks/1.0b5#google_vignette). I found ant-contrib-1.0.jar at http://www.java2s.com/Code/Jar/a/Downloadantcontrib10jar.htm.

Once you have these .jar files: place them both in the lib directory of your Apache Ant folder. 

You are now ready to test out staticSearch. Here's what you do:

  1. Open the Macintosh terminal application. You will find this in your Applications/Utilities folder. This is a good old-fashioned command-prompt system, like we all used back in the 80s (remember the 80s? Wham? Freddy Mercury? yes, those). 
  2. In the terminal: move into your static search folder. If you have unpacked it into Applications as "staticSearch-1.4.4" you should type "cd /Applications/staticSearch-1.4.4" into the terminal
  3. Now you are ready to test out all is working. For this you have to run Ant. You do this with the following command at the terminal "/Applications/apache-ant-1.10.13/bin/./ant" (assuming you have got Ant in a directory named apache-ant-1.10.13 inside Applications"). If all is installed correctly you should see a lot of things on the screen and, finally, a triumphant "BUILD SUCCESSFUL" message comes up. (If you are smarter than I am you might be able to edit the $PATH statement in your terminal profile so that you just need to type "ant" into the terminal, and not "/Applications/apache-ant-1.10.13/bin/./ant". It seems Apple do not want you to edit your terminal profile, and are making this rather difficult: see https://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac.)
Now, try it with your own HTML. The staticSearch documentation is excellent. I created a folder called "mystuff"inside the staticSearch folder. In this folder I put all my html, itself in another folder called "html". I had an index.html file in the root of the mystuff folder and I had an xml file called "ssconfig.xml" containing the key instructions directing staticSearch to work on my html:

<config xmlns="http://hcmc.uvic.ca/ns/staticSearch">
 <params>
<searchFile>index.html</searchFile>
<recurse>true</recurse>
 </params>
</config>

I now ran staticSearch on my material with this command: 

/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=/Applications/staticSearch-1.4.4/mystuff/ssconfig.xml
(I could also have used just "/Applications/apache-ant-1.10.13/bin/./ant -DssConfigFile=mystuff/ssconfig.xml" as I am already in the staticSearch folder)

The first time I tried this, it did not work. It turns out that the <params> declaration needs a whole lot more it it or you get a failed build. <params> needs to contain declarations as follows:
       <phrasalSearch>true</phrasalSearch>
        <wildcardSearch>true</wildcardSearch>
        <createContexts>true</createContexts>
        <resultsPerPage>5</resultsPerPage>
        <minWordLength>2</minWordLength>
        <maxKwicsToHarvest>5</maxKwicsToHarvest>
        <maxKwicsToShow>5</maxKwicsToShow>
        <totalKwicLength>15</totalKwicLength>
        <kwicTruncateString>...</kwicTruncateString>
        <verbose>false</verbose>
        <stopwordsFile>test_stopwords.txt</stopwordsFile>
        <dictionaryFile>english_words.txt</dictionaryFile>
        <indentJSON>true</indentJSON>
It turns out that this issue is a part of a wider discussion in the SS community on what needs to be declared in the set-up, and what can be set as defaults. See the discussion in the comments on https://github.com/projectEndings/staticSearch/issues/270, where I first reported my experience, and on https://github.com/projectEndings/staticSearch/issues/195, where the wider discussion takes place.

Now that I had staticSearch running: the next step was to start integrating it into our own HTML. That's the subject of the next post.




The Endings Project and the Canterbury Tales Project (and also, Boccaccio and Dante)

 At last, after many years, we (ie, me and a few other people) are getting ready to unleash on the world a whole series of digital scholarly editions. We have already released the second edition of Prue Shaw's Commedia, now at www.dantecommedia.it. We are now contemplating a third edition of that. Soon to come are Bill Coleman and Edvige Agostinelli's edition of Boccaccio's Teseida. And then the really big one: the first tranches of the Critical Edition of the Canterbury Tales. Based on All Known Pre-1500 Witnesses, with myself and Barbara Bordalejo as General Editors. All of these will appear in the next twelve months.

Why so long? We (as before) have been working on all these since the 1990s (the Dante and Chaucer) and 2000s (Boccaccio). There are multiple reasons. For this post, one reason is specially important: we wanted to be sure the edition could survive the chances of online time. It should stand alone, for decades and even centuries to come, as surely as a print edition might survive upon a library shelf. How could we achieve this, given all the shifting currents of the digital world?

We were not the only people worrying about this. From 2016 a five-year SSHRC grant (Canada) funded the Endings project. This project took as its starting point a number of digital projects based at the University of Victoria which faced exactly the same issue we had: how can these projects be given the best chance of survival long into the future? In fact, I did not come across the Endings project until a long way into the making of Shaw's second Commedia edition. By this time I had already reached identical (or nearly so) conclusions as the Endings project, as follows:

1. While our development of these editions had used custom database technologies to present and edit all project data, our published editions would not use databases or any related "server-side" technology at all: no databases, no PHP, no python, nothing. That is: everything would be contained on one server with no outside dependences at all so far as our texts are concerned

2. Our presentation of the texts would rely solely on the core web technologies of HTML5, css and javascript. Nothing else.

3. Any departures from these principles for any part of our edition (for example: the use of external JavaScript libraries; the use of IIIF image viewers) would use widely-used open source tools.

These principles correspond the Endings project principles 4.1, 4.2 and 4.9. In some areas, however, our practice differs from that of the Endings project. For example, we do use the JQuery library, which in my view has now achieved core web technology status. I think the same is becoming true of the IIIF family. However, I do not think the same is true of XML technologies (nor, interestingly, do the Endings people) and we do not use XSLT, etc, as any part of our final publication model. We also use query strings, which again seem to me a core web technology, where Endings does not. Nor do we aim for "graceful failure" where css/javascript/something else does not work. It seems to me that providing all source data within the edition, permitting others to fashion new interfaces to our data, is the best way of anticipating any failure.

One might object: we are making a bet on certain core technologies now still being core technologies centuries in the future. Yes we are. But we see this bet as being in the same category as the bet scholars have made for millennia: that there will be a library or other place somewhere in the future which has a shelf for my book.

Another principle of the Endings project is that it will not use an external service to provide functionality, and specifically names Google Search as such a service. In my early preparations for the Shaw edition, I had investigated using Google Search to provide a search tool. Indeed, the second edition at www.dantecommedia.it implements searching in exactly this way. You can see from just a cursory use of Google Search in the second edition how unsatisfactory it is. Searching for "come", one of the most common words in the Commedia, gives just one result; "tanto" yields none at all. Many search results begin with advertisements, for holidays, or beer. I spent many hours trying to get Google Search to do better, including feeding it hard-wired urls to every page of transcription. Nothing seemed to work. It appears the Google algorithms rebel when faced with nine near-identical texts, and fail over and over to return anything like meaningful results. 

For these reasons I was contemplating just how a stand-alone search system might be implemented, when I came across the Endings project, and StaticSearch. They had done it! and it worked! On another page, I describe my experiences of StaticSearch.