The article which follows was posted on the Scholarly Digital Editions blog in a series of entries from February to June 2007. There were two contexts for these original blog posts:
Now, the original blog posts, as posted between 12.58 pm GST, 5 February 2007 and 10.02 pm, 28 June 2007. I follow this with an email I sent to Fotis, Joris and others announcing these posts. Among other matters: the post of June 21 announces the new name: CollateX. Thus it has been since.
February 5, 2007
The design of
CollateXML
Filed under: Designing Collate — Peter @ 12:58
pm
In this document I set out, as clearly as I
can, the various datastructures and operations which I think CollateXML will
require.
The fundamental design of CollateXML is this:
1. The input is various streams of
text, divided into marked collation blocks
2. These various streams of text are located
3. Within the streams of text, each corresponding block for collation
must be located
4. The collation program creates two sets of collation information:
1. concerning the
different orderings of the blocks within the streams of text
2. concerning the
differences in the texts contained in the blocks themselves
The collation information is then formatted for
output
A few observations:
- In Collate0-2,
the input was always computer files, held on the computer itself. In
CollateXML, the input could be ‘any text, anywhere’: from a database, local or remote;
from a URL anywhere.
- The crucial
marking of collation blocks should be done through something like the
‘universal text identifier’ scheme I outlined at The Hague on 25 January 2007.
- Collate0-2 did
only ‘word by word’ collation. This presumes that the texts are ‘word by word’
collatable: without very large areas of added, deleted, or transposed text. But
many texts have a different kind of relation: large portions of one text might
be embedded in another text, but other areas of the texts are very different
(the situation common in plagiarism, or ‘intertextuality’, for example).
Collate0-2 did not handle this situation; CollateXML should be able to do so.
- CollateXML
should have its own internal data models for passing information both to and
from the collation process. These models should be exposed through an API to
programmers, who can then provide import and export for whatever formats they
choose.
We can now begin to specify the
building blocks we need.
February
5, 2007
How
CollateXML should work
Filed
under: How CollateXML should work — Peter @ 1:18 pm
The separation of
collation into stages
As Collate0-2 developed, I
learnt that one had to break the collation process into stages. At first,
Collate 0-2 simply collated, and identified the variants as it found them. I
soon learned that the complex requirements of scholarly collation demanded the
adjustment of the collation at various points. To do this, it became clear that
one had to separate out the stages of collation to permit intervention at various
points. However, this separation was grafted onto Collate0-2 in a a piecemeal
fashion. I propose that from the beginning, CollateXML separate out what appear
to me now as the following fundamental stages of collation:
- text alignment, one witness at a time
against the base
- storage of alignment information for all
witnesses against the base
- adjustment of alignment information for
all witnesses against each other
- variant identification within the aligned
texts.
Text alignment
The fundamental building
block is the alignment routine itself. Here is how I suggest this works for
word by word collation, based on how it worked in Collate 0-2.
- Each alignment act compares two texts of a
text block at once: a specified ‘base’ text and a witness text. starting
at the first word of the text blcok in each
- The alignment examines the first word: if
they are identical it returns this information; if there is a variant it
returns that information along with: the number of words matched in base
text and witness text. The possibilities are:
- same word in each. 1 word matched in
each; next word to match in base will be word 2; next word to match in
witness will be word 2
- one word replacing one word in each. next
word to match in base will be word 2; next word to match in witness will
be word 2
- word omitted in witness. next word to
match in base will be 2; next word to match in witness will be
- word added in witness. next word to match
in base will be 1; next word to match in witness will be 2
- phrase omission or addition: as for c and
d, but the next word to be matched in base or witness will be adjusted
accordingly
- phrase replacement: if two words in the
base are replaced by three words in the witness: then the next word in
the base to be collated for this witness will be word 3; the next word to
be collated in the witness will be word 4.
This is, fundamentally,
rather simple. You can look over the C code for Collate 2 to see how we did
this. Essentially, at each alignment, Collate 2 carried out a series of tests,
till it got a match:
- are the next words identical?
- are the next words a variant, and so align
against each other? Collate used a ‘fuzzy match’ algorithm for this.
Essentially, if the two words had more than 50% of their letters in common
(weighted according to the position of the letters) then Collate said,
these words align. Thus, Collate would see ‘cat’ and ‘mat’ as variants on
each other
- it could be that while this word does not
match, the next word does. So Collate will look at ‘black cat’ and ‘white
cat’ and declare that ‘black’ and ‘white’ align, because the next word is
a match. Indeed, Collate would look at ‘black cat’ and ‘white mat’, see
that mat/cat align because they satisfy the fuzzy match test, and so
declare black/white align
- If there is still no match: collate tests
the second word in the master against the first word in the witness. If
they match: Collate concludes that the first word in the master is
omitted.
- Still no match: collate tests the second
word in the witness against the first word in the base. If they match:
Collate concludes that the first word in the witness has been added. Now,
here is an important point: after establishing that the first word in the
witness has been added, Collate goes around again to collate the SECOND
word in the witness against the first word of the base, and reports a
SECOND variant at this point. For example: if the base has ‘mat’ and the
witness has ‘black cat’ Collate could report that ‘black’ has been added,
and that ‘cat’ is a variant on ‘cat’. See further
below on additions and omissions.
- Still no match: Collate guesses that maybe
the problem is word division. So it concatenates words in the base and the
witness, comparing as it does, to see if it can find a match
- Still no match: Collate starts searching
for phrase variants: addition/omission/replacement. In essence, it looks
further along the text, seeking to find sequences that match, with
everything up to the match a replacement/addition/omission. This is
probably the least sophisticated part of the current collate. Collate also
has a limit of 50 words for its look up: this might be lifted.
- After Collate has found the match on this
first word: it then looks to check if the NEXT match between this witness
and the base is an addition in this witness. The reason it needs to do
this is explained in the section on additions and omissions below.
After Collate has done this
for this word in the base against this witness: it goes on to do the same for
the next witness against this same base. As it identified each alignment, it stores
the alignment information for each witness. When it has worked its way through
all the witnesses for this word, and has stored the alignment for each: it
proceeds through the next stages of adjusting the alignment and finally
identifying the actual variants.
After completing, as
described, the alignment for the first word of the base against all the
witnesses: Collate now goes on to align the second word of the base. Notice
particularly what happens when Collate discovers that it has already matched past
the next word in the witness: when, say, the first six words of the base have
been replaced by the first eight words of the witness. In that case, Collate
will skip over that witness until it is collating word 7 of the base: it will
then restart the collation by collating word 9 of the witness against that
word.
Alignment is NOT
variant identification
I have spoken so far only
of text alignment, not variant identification. The difference is important.
Here is an example:
Base the black cat
A The black cat
B THE BLACK CAT
C The black cat
D The, black cat
For the purposes of text
alignment: the collation algorithm should ignore case differences, punctuation
tokens, and XML encoding around or within the words. Thus: it should identify
the first word of each one of the four witnesses as aligned against the first
word of the base. But note: it might be desirable to identify each or any one
of the four first words as having a variant at this point. This variant
identification is to be done at a later point. For now, all we have to do is
state that the first word in each of the four witnesses aligns against the
first word of the base.
Additions and
omissions in Collate
This is a particularly
difficult area. I discovered in the course of writing Collate0-2 that different
people want all kinds of different things. Some people do not want to see
additions and omissions at all, but only replacements of shorter or longer
phrases by longer or shorter ones. When it is an addition and the scholar
wants this seen as a replacement of a shorter phrase by a longer one, some
scholars want to see the addition attached at the beginning, some at the end.
Take this text:
Base: a cat
witness: a black cat
Here are the possibilities:
a ] a; black added
cat ] cat (writing the addition with the PRECEDING word)
OR
a ] a
cat] cat; black added (writing the addition with the FOLLOWING word)
OR
a ] a black
cat ] cat (as phrase, addition with preceding word)
OR
a ] a
cat ] black cat (as phrase, addition with following)
OR
a ] a
… ] black
cat ] cat (this is actually the system used in Munster!)
Collate0-2 supports all
these possibilities, but does it in a rather inelegant way. Essentially, it
tries to adjust for these possibilities WHILE it collates. This is complex and
inflexible. Instead, I propose that CollateXML separate completely the
discovery of alignment, its storage and its expression. Collate0-2 almost does
this, but does not do it thoroughly. Broadly, I began with outputting the
variants as the program discovered them. Increasingly I found that one needed
to adjust the variation in various ways, and so moved towards separation of the
stages of alignment discovery, storage and variant identification. A major
benefit of this separation is that it permits adjustment of the variation at the
storage point: see next. However, Collate0-2 never quite managed a complete
movement to this separation. I propose that CollateXML has this separation at
its heart.
The storage of
alignment information
So far, I have been describing how Collate discovers alignment. Essentially,
Collate discovers, for each word in the base text, for a particular witness,
exactly what alignment is present in a given witness at that point. The possibilities are:
- base and witness align at this word,
either because there is no variation (base and witness agree on this word)
or because base and witness vary at this word (either as: omission of this
word, or variation on the word)
- base and witness align and there is an
addition before and/or after this word (note: this includes the
possibility of an addition before the word, omission of the word, and
addition after the word)
- the word is the beginning of a phrase
alignment, with or without an addition before the phrase aligment (note:
this includes the possibility of phrase omission
- the word is the ending of a phrase
alignment, with or without an addition after the phrase alignment
(including, the possibility of phrase omission)
- the word falls within a phrase alignment
(for example: ‘the black cat’ replaced by ‘a white mouse’. When Collate
comes to collate ‘black’ in the base against this witness, it will find
that it falls within a phrase alignment and move onto the next word.
It can be seen that
alignment is much complicated by the need to deal with ‘additions’. Again,
Collate0-2 never quite dealt with this as well as it needs to, and again I
propose to remedy this in CollateXML. I suggest that they
be dealt with as follows:
- The base text is seen as a series of
slots, corresponding to the words AND the space before the first word, between
each word, and after the last word
- The variants in each witness be aligned
against these slots. Thus: an addition before the first word is aligned
against the slot before the first word; a variant at the first word is
aligned against the first word; an addition between the first and second
words is aligned against the slot between these words, and so on.
One may illustrate this
with the base ‘black cat’ collated against the witness ‘the black and white
cat’. Numbering each ’slot’ in the base from zero, we have:
numbers
|
1
|
2
|
|
3
|
4
|
base
|
|
black
|
|
|
cat
|
witness
|
the
|
black
|
and
|
white
|
cat
|
Thus: the additions ‘the’
and ‘and white’ align against slots 1 and 3. In this system, even numbers are
used for words; odd numbers for the spaces between the words. (I am indebted to
the Institute for New Testament Research, Munster, for this numbering system,
and this conception of the base as a series of slots for both words and the
spaces between them).
The adjustment of
variant information: a relatively simple case
The principal benefit of the separation of the stages of alignment discovery,
storage and output is that it permits adjustment of the variant alignments at
the storage stage and before the output.
Consider the case of the
following (rather fictitious) instance:
Base: The cat sat on the
mat
Witness: The black sat on the mat
Left to itself, Collate0-2
will tell us that the variant is:
cat ] black
But in fact, what has
happened here is more correctly:
The ] The
.. ] black
cat ] omitted
That is: first ‘black’ is added,
then somehow ‘cat’ is omitted.
We may have many witnesses which read ‘The black cat’ where the base reads ‘The
cat’. In this case, at the storage stage, we should expect Collate to look over
the variants discovered in the other witnesses, find that in many others we
have ‘black’ added, and it should then adjust the stored variant information so
that instead of reading:
numbers
|
1
|
2
|
3
|
4
|
5
|
6
|
base
|
|
the
|
|
cat
|
|
sat
|
witness1
|
|
the
|
|
black
|
|
sat
|
witness2
|
|
the
|
black
|
cat
|
|
sat
|
witness2
|
|
the
|
black
|
cat
|
|
sat
|
the stored representation reads:
numbers
|
1
|
2
|
3
|
4
|
5
|
6
|
base
|
|
the
|
|
cat
|
|
sat
|
witness1
|
|
the
|
black
|
|
|
sat
|
witness2
|
|
the
|
black
|
cat
|
|
sat
|
witness2
|
|
the
|
black
|
cat
|
|
sat
|
That is: with ‘black’
matching against the space between ‘the’ and ‘cat’ (as it is in other
witnesses) rather than against ‘cat’.
Collate0-2 did NOT do this
variant adjustment. CollateXML should do it. In the next section, I consider
some possibilities.
The adjustment of
alignment information: towards multiple progressive alignment
This matter of automated adjustment of variant information at the storage stage
— that is, after the collation of a particular word has finished — is one area
where the algorithms of Collate0-2 could be dramatically improved.
Consider, first, this case:
base the
white and black cat
witness1 the black and white cat
Collate0-2 will record this
as a single piece of variant information: that the whole phrase ‘white and
black’ in the base has been replaced by the whole phrase ‘black and white’. It
has been pointed out to me, quite separately, by two very different groups of
scholars, that this is inadequate (the two groups are: the Münster institute,
and the department of Molecular Biology in Cambridge):
- This does not record that the words of the
variant text ‘black and white’ are actually the same as those of the base
- As a result: suppose that a second witness
has ‘green or blue’ for this phrase. To the program (and hence, to any
system based on it) there is exaclty as much difference between the
variants ‘black and white’ and ‘green or blue’ and the base text ‘white or
black’ as there is between each variant and the base text. But this loses
a key piece of information: that the variant ‘black and white’ is actually
much close to the base than is the variant ‘green or blue’.
CollateXML needs to find a
way of adjusting the variant store to show that in fact the variant ‘black and
white’ represents not one, but four pieces of information:
- firstly, that there is a phrase variant
(the existing Collate0-2 algorithms do this)
- secondly, that actually each word in the
phrase variant does agree with the base: a further three pieces of
information (Collate0-2 goes a little way towards this, but not far
enough)
Consider, further, this
case:
base the
white and black cat
witness1 the black and white cat
witness2 the black and green cat
Here, we should show that
witness2 both has a phrase variant AND is a witness for the words ‘black’ ‘and’
— and, furthermore, has a variant ‘green’ on the word ‘white’ in both the base
and witness1. One wants an output as follows:
white and
black ] black and white witness1; black and green witness2
white ] witness1; green witness2
and ] witness1 witness2
black ] witness1 witness2
If we can figure out a way
to store this information then we are well on our way to collation nirvana:
multiple progressive alignment. But before we get to that place: we have to
understand parallel segmentation.
Variant information
storage and parallel segmentation
Perhaps the single most important development in Collate2 was the support for
parallel segmentation. I write about this in the ‘Collation rationale’ on the
Miller’s Tale CD-ROM. The example I use there is
This
Carpenter hadde wedded newe a wyf
This Carpenter hadde wedded a newe wyf
This Carpenter hadde newe wedded a wyf
This Carpenter hadde wedded newly a wyf
This Carpenter hadde E wedded newe a wyf
This Carpenter hadde newli wedded a wyf
This Carpenter hadde wedded a wyf
In that article I explained
that in the early versions of Collate, we used to collate this by what I called
‘base text collation’: that is we would compare each witness (54 in this case)
word by word with this one base, one witness at a time, and output the
variation so:
This
] 54 witnesses
Carpenter ] 54 witnesses
hadde ] 54 witnesses
wedded ] 53 witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
newe ] 26 witnesses; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses
a ] 30 witnesses
wyf ] 54 witnesses
We see here that for the
first three words and the last word there is no variation, and we just state
accordingly that all witnesses there agree with the base and with each other.
All the variation occurs on the three base text words ‘wedded newe a’. This
variation is actually recorded against five lemmata: in turn ‘wedded’, ‘wedded
newe’, ‘newe’, ‘newe a’ and ‘a’. Observe that the phrases ‘wedded newe’ and
‘newe wedded’ both overlap one other, and also overlap the three words ‘wedded’
‘newe’ ‘a’.
The ‘Collation rationale’
article goes on to explain why we became increasingly dissatisfied with this
method. One factor was that it highlighted the base text: by referring all
variation to this base text, it gave the base text a prominence which we did
not think appropropriate. We thought of the base text as just a series of slots
on which we hung the collation: but this mode of expression seemed to give it
an authority beyond this. It is not that we do not believe in ‘edited’ texts:
just that this base text was not conceived, or intended to be, any such edited
text. But its prominence made it look as if it could be such an edited text.
A second factor was the
argument put to us by the evolutionary biologists: that where variant lemmata
overlap, as they do in the cases of the five variants on the three words ‘wedded
newe a’, one cannot compare directly the different witnesses. Here, we have one
set of variants on the phrase ‘wedded newe’ and a second on the phrase ‘newe
a’, as well as variants on each individual word. If manuscript A has a variant
on ‘wedded newe’ and B has one on ‘newe a’ there is no way one can compare the
text of A and B directly, and make any statement at all about the relationship
between A and B at those points.
This defect in base text
collation had other implications. We wanted to be able to point at any word in
any manuscript and say: what readings do the other manuscripts have at this
point? But this was exactly what our system could not do. With our system, we
could only say: at this word, the base text has such and such. We could not
always say: at this word, here are all the readings found at this point in all
the other texts. Similarly, we wanted to be able to compare any two (or more)
manuscripts word by word, showing exactly how they differ. Once more, this
system could not do that: we could only show how they severally differed from
the base text, not how they differed from each other.
The only cure for this we
could see was: eliminate overlapping variation. This meant that we should refer
all variants in all witnesses to the same base lemma. This meant that,
practically, the unit of variation has to be fixed by the longest variant
present at any point. In the case of the Miller’s Tale example: with base text
collation we have five sets of lemmata in the three word base sequence ‘wedded
newe a’, and so cannot compare the witnesses on any one of the lemmata with
those for any other. To eliminate all overlapping variation here we should have
one lemma and one lemma only: all three words of the base text here. All
variants on this one lemma are then directly in parallel with each other. The
whole text, across all the witnesses, is broken into parallel segments, with
text of any one witness at any one segment being directly comparable to the
text of any other witness at that segment: hence, the name ‘parallel
segmentation’.
This is the collation given
by the base text collation system, with five different lemmata:
wedded
] 53 witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
newe ] 26 mss; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses
a ] 30 witnesses
Now, this is the collation
given by parallel segmentation, with just one lemma:
wedded
newe a ] wedded newe a 25 witnesses
wedded a newe 23 witnesses
newe wedded a 1 witness
E wedded newe a 1 witness
wedded newly a 2 witnesses
newli wedded a 1 witness
wedded a 1 witness
How did Collate0-2 identify
parallel segments? Collate0-2 used a system of variant information storage
similar to that outlined above: essentially, creating a table which in numeric
form exactly what words in each witness correspond with what words in the base.
It would update this table after each word collated in the base. Then, it would
inspect the table, and ask: is there a variant lemma open a this point? If
there were, then it would not output any apparatus, but move on to the next
word, and only when it found no variant lemmata open would it output all the
variants on the whole segment of text.
Thus, for the base text
sequence ‘wedded newe a’ it would proceed as follows. It would collate the
first word, ‘wedded’, and discover the following:
wedded ] 53
witnesses; E wedded 1 witness
wedded newe ] newe wedded 1 witness, newli wedded 1 witness
That is, the lemma ‘wedded
newe’ is still open after collation of ‘wedded’. So no apparatus is output, and
it goes onto the next word:
newe ] 26
mss; newly 1 witness; omitted 1 witness
newe a ] a newe 23 witnesses
Now, the lemma ‘wedded
newe’ has been closed. But another variant lemma ‘newe a’ is now open. So we
have to carry on to the next word:
a ] 30
witnesses
Now, at last: no phrase
variant is open. We can close the segment, and output all the variation found
on the whole phrase ‘wedded newe a’.
The limits of
parallel segmentation: toward progressive multiple alignment
Parallel segmentation has served us well. It has allowed us to remove the base
text from the apparatus output completely: on our publications now, you do not
see the base text at all. We still use a base text when we collate, but its
function now is purely to identify the variants present at each point, and we
customarily optimize it for that purpose (for example, adding or rearranging
words to improve variant identification). The move to parallel segmentation has
other benefits. We can now identify at any point in any witness just what
witnesses are present at that point; we can compare any two (or more)
witnesses; we can create much richer analyses of stemmatic relations. But we
are still not happy.
In the ‘Collation
Rationale’ argument I cite the variants on the first four words of line 646 of
the Miller’s Tale (’He was agast so of Nowelys flood’).
- He was agast
so 33 witnesses
- He was agast
4 witnesses
- So he was
agast 6 witnesses
- He was so
agast 7 witnesses
- He was agast and feerd 2 witnesses
- So was he
agast 1 witness
Just as a presentation of
the variation at this point, this is quite efficient. But as a representation
of the exact linkages between the witnesses, it is rather inefficient. These
six variants are presented in simple parallel, as if no two of them are any
closer than any other. But manifestly, that is not true. The second and fourth
readings ‘He was agast’ and ‘He was so agast’ are much closer to the first
reading ‘He was agast so’ than they are to either the third and sixth readings.
In turn, the third and sixth readings ‘So he was agast’ and ‘So was he agast’
are much nearer each other than they are to the other readings.
With parallel segmentation,
once it has found the segments, the collation stops and just presents the
segments it has found. In this collation system, all variants at any point are
equally unlike. We require some system of grouping the variants within each
segment. For this example, I proposed that that the six variants here should be
grouped into two variant sequences:
- 46 witnesses: made up of variants 1, 2, 4,
5, all beginning with the words ‘He was..’
- Seven witnesses: made up of variants 3 and
6, both beginning with ‘So’
We can break up the first
group still further:
- 40 witnesses, made up of variants 1 and 4,
having the same words but with ’so agast’ transposed
- 6 witnesses, made up of variants 2 and 5,
both omitting ’so’
Finally, we note that the
two groups 1 and 2 (of 46 and 7) are linked together via variant 1 (from group
1) and variant 3 (from group 2): these differ only in their placement of the
word ’so’. We can represent this schematically as follows:
From examination of the
variant map, we can see that — rather remarkably (or not!) — this representation
mirrors the textual history of the tradition. The original reading is likely to
have been variant 1 (33 mss). Three variants descended directly from variant 1:
Variant 5 (seven witnesses)
by transposition of ’so’, from which a further variant (variant 6, one witness)
develops, by transposition of ‘he was’
Variant 4 (7 witnesses) by transposition of ‘agast so’
The ancestor of variants 2 and 5: both omit the ’so’, while 5 adds ‘and
feerd’..
Indeed, this distribution
is consistent with other groupings established by our analysis.
So, here is the challenge I
set in the ‘Collation Rationale’ article, here set out in more detail:
- Identify relationships between the variant
groups found by parallel segmentation
- Work out a way of storing the information
about these relationships, so as to enable different kinds of output
- Work out the best ways of expressing this
information, in some kind of hierarchical or layered form.
At present, we do not have
any means of formally expressing the relationships between the variant groups
found by parallel segmentation. Here is a draft of how it might be done, using
the example above:
He was agast so
|
|
|
|
|
He was so agast
|
|
|
|
|
He was agast
|
|
|
|
He was
agast and feerd
|
|
So he was agast
|
|
|
|
|
So was he agast
|
|
|
We need an adaptation of
the system used by the TEI to hold this. Ideas please!
Variant
identification
So far, we have aligned the texts, stored the alignment identification, and
then adjusted the alignment information (we hope, through some form of multiple
progressive alignment). But we have not yet identified any variants. Now,
consider again our example from above:
A
The black cat
B THE BLACK CAT
C The black cat
D The, black cat
Following parallel
segmentation, we may now ignore the base. We look at the first word, and find
they are aligned as follows:
A
The
B THE
C The
D The,
Are these, or are these
not, variants of each other? I propose that Collate3 have, for each witness, a
specifications object. This will state, for each witness, whether differences of
case, XML encoding, and punctuation are to be treated as variants or not.
Presume that we direct that case differences and XML encoding are not variants
but that punctuation is. We would get the following collation taking A as the
base
The
] A B C; The, D
Taking B as the base: the
variant would appear as
THE ] A B C; The, D
Or, if we say that
punctuation is not significant, but XML encoding is significant, we will get
this collation:
The ] A B D; The C
Variant
identification and the return of the base
I said above we may now
discard the base. To a point, Lord Copper (esoteric joke, see Decline and
Fall). There is one critical operation for which we still must retain the base.
The question is to do with
the use of variant specifications to identify exactly what is a variant.
Suppose for our pair A and D we have the variants THE (A) and The, (D). We have the following variant specifications:
A: ignore
case and punctuation
D: ignore case but do not ignore punctuation
We now compare A and D:
‘THE’ and ‘The,’. From the point of view of A: there is no variant here,
because we are ignoring both case and punctuation. But from the point of view
of D: there is a variant, because we there is a difference of punctuation.
Thus: the variation found
changes, according to the point of view. It changes too (obviously) according
to which witnesses we are comparing. The only way I can see out of this is to
use the base as the measure against which variants are identified, but always
do the variant identification using the specifications set for the witness. In
this case, presume that the base here is ‘The’, with all witnesses set to
ignore case but not punctuation. We will then have:
The
] A B C; The, D
Notice that depending on
the base text and the collation specifications, we could get very different
results. Suppose that we set punctuation to be ignored in A B C but not D. If
we use ‘The’ as the base text, we get this:
The
] A B C; The, D
But if we set ‘The,’ as the
base, we get this:
The,
] A B C D
I don’t see any way around this.
One could avoid this (as Collate0-2 did) by insisting that all witnesses have
the same collation specifications. But it has been forcefully represented to me
that it would be very useful to be able to specify different treatments of
case/punctuation/xml for different witnesses. So we will
do this.
February
6, 2007
Datastructures
for CollateXML
Filed
under: CollateXML datastructures — Peter @ 5:46 am
From the account of the
collation, we are dealing with something very different from ’string comparison’.
Indeed, the base unit of the collation is the word: we collate words, not
strings. Words may be concatenated, or divided: but words are the basis of it
all. (This was the form used by Collate).
For each witness, we need
the following information:
- Its sigil
- Its location (in Collate0-2 this was
simply a file name; in CollateXML it might be a url, an xquery or xpath
expression, etc)
- Collation specifications for this witness.
See below.
- For each collateable block: two
collateable object arrays. See below
- For each collateable block: an array of
correspondences with the base. See below.
The collation
specifications for variant identification
These will control the way
what is recorded as a variant against the base. Settings include:
a. case. settings will be collate/ignore.
if collate: Collation will
treat differences of case as variants.
if ignore: Collation will not treat differences of case as variants.
b. xml. ignore xml.
Settings will be: all/none/nomininated
If none: all xml encoding
surrounding, within or between the words will be ignored
If all: all xml encoding will be collated, including empty elements,
surrounding, within, and between the words
If nominated: only specified xml elements will be nominated. The details of the
xml elements to be collated will be held in a further structure (see below).
c. xmlcollate: null unless
xml=nominated. This structure is a series of elements to be collated, as
follows:
i. gi: the gi of the
element to be collated (including namespace)
ii. attributes: Values are all/none/nominated. If all: all attributes and their
values are to be collated; if none, all attribute values are ignored, and only
element names are collated; if nominated, details of attributes to be collated
are held in a further structure
iii. collateattributes: null unless attributes=nominated. This structure is a
series of attribute names which will be collated for this element (this could
be further elaborated, perhaps, to set conditions: report as variant if the
attribute is a particular value)
d. punctuation. Settings
will be all/none/nominated
if all: collate all
punctuation, as identified by the isPunctuation method
if none: collate no punctuation, as identified by the isPunctuation method
if nominated: collate only specific punctuation identified by the isPunctuation
method
The specifications object
must also have at least one method: isPunctuation. For a particular pair of
strings, this should identify whether differences between them are purely
punctuation (in which case, they might or might not be variants) or not.
Two other methods might be required:
isCaseDifference: if it is
found that Java’s native methods for ignoring case difference when comparing
strings are not adequate.
adjustXML: for some contexts, we may need to do more than simply ignore/not
ignore XML.
Consider:
ex&per;perience
One might here wish to ignore the &per; element and treat this as
‘experience’.
The collation
specifications for text alignment
The model here proposed, of
separating text alignment from variant identification, presumes that optimal
text alignment would be achieved by ignoring differences of case, punctuation
and xml. Thus, at the alignment stage, we would use the minimal set of
collation specifications for comparison of witnesses with the base.
Hierarchical setting
of collation specifications
One would expect that for most collations, one would have identical
specifications for all witnesses. In programming terms: one would set the
specifications for the class of witnesses, which would then inherit a uniform
set of specifications. This design permits that the uniform specification would
be overruled for specific witnesses.
The collateable
object arrays
The key to Collate0-2 was
that it did not collate text strings: it collated word objects. For each
witness, it held the words of the text in an array of word objects, numbered
from 0 to xxx, and all collation took place against these word objects, with
information about variants found stored in tables of numbers referring to these
arrays. I propose that CollateXML retain, refine and extend this model.
Collate0-2 accepted ‘plain
text’ and converted this to word object arrays as it collated. As it did so, it
might remove (depending on various settings) punctuation or other characters
from the text to be collated. Thus ‘april / that’ would become:
word 1: April
word 2: that
Notice that the ‘/’ is here
removed. At a later point, Collate0-2 converted the text to
<w
n=”1″>April</w> / <w n=”2″>that</w>
This is rather
unsatisfactory. The relationship between the numbering of the words in the word
object array and that in the converted XML depends on rather fragile
assumptions about what is and is not a word. I propose instead that CollateXML
recommend that for word-by-word collation, input must be in full XML form, with
all discreet elements marked as follows:
<w
n=”1″>April</w> <w n=”2″>/</w> <w
n=”3″>that</w>
This has several
implications. It means that, because of the problem of overlapping hierarchies,
treatment of elements spanning across words has to be as follows:
<w
n=”1″><hi>April</hi></w> <w n=”2″><hi>/</hi></w>
<w n=”3″><hi>th</hi>at</w>
not
<hi><w
n=”1″>April</w> <w n=”2″>/</w> <hi><w
n=”3″>th</hi>at</w>
The advantage of the
explicit labelling of every collateable object in the original text as a
<w> element with an ‘n’ attribute is that it makes linking of the
collation with the original text absolutely explicit. The ‘n’ attribute on each
<w element can be used to denote each word in the collateable object array,
and then used to link to the corresponding <w element in the original. (One
might — might — use xPath to achieve the same result: that is a matter for
discussion.)
I said we need TWO
collateable object arrays for each witness. The first array, as specified
above, is to hold the original text: call this textOriginal. But in fact, this
is not the text which will be actually collated. The second array is the text
which will be actually collated: call this textCollateable.
TextCollateable
will have identical structure and initially identical content to textOriginal.
The reason for the two
arrays is to make regularization possible. Regularization was one of the great
strengths of Collate0-2, and the approach here suggested is based closely on
how Collate0-2 worked. As the scholar collates, he or she will see cases where
it is necessary to filter out spelling or other non-significant variation. This
may involve alteration of word division. Thus, we might be collating:
base: the man Cat
wit1: theman cat
It appears that in wit1 one
will want to change the word division for ‘theman’ and regularize ‘cat’ to
‘Cat’. Thus, textOriginal would hold for wit1:
word1 theman
word2 cat
while textCollateable must
be altered to:
word1 the
word2 man
word3 Cat
Notice that this will mean
keeping an offset pointer at each word, indicating for each array what is the
corresponding word in the other array.
Putting this together, we
require the following information for each word object in each collateable
object array:
- the word itself (including, XML encoding)
- the n number for the word, to relate to
the n number on the corresponding <w> element in the original
- the offset to the corresponding word in
the other array. Thus: for word 1 in textCollatable the offset would be 0;
for word 2 and word 3 it would be -1. For word 1 in textOriginal the
offset would be 0; for word 2 it would be +1.
June 21, 2007
Goodbye CollateXML, hello CollateX
Filed
under: Introduction — Peter @ 1:32 pm
At last, we are moving.
Today I began setting up the source forge site that will take all the code as
we start work on the program. And in the process, I did what I have been
planning to do for some time: change the name of the program from CollateXML to
CollateX. Those who have read all the postings on this (of course, all of you)
will know the reason for this change. We plan that the program should be able
to collate texts in any format whatever, by devising a single canonical input
form and then having translators into this canonical form. Thus it will be able
to collate XML, sure: but it will also be able to collate many other formats,
including indeed old-style Collate 1-3 files.
June 28,
2007
Collate
examples, and first task
Filed
under: Collate — Peter @ 10:02 pm
We have decided that the
logical place to start is by definition of data structures for the common input
phase. So Andrew will get on today with working these out. Here are a few
example sets for him to chew on:
Base the black cat
A The black cat
B THE BLACK CAT
C The black cat
D The, black cat
2:
Base the white and black
cat
A The black cat
B the black and white cat
C the black and green cat
3:
This Carpenter hadde wedded
newe a wyf
This Carpenter hadde wedded a newe wyf
This Carpenter hadde newe wedded a wyf
This Carpenter hadde wedded newly a wyf
This Carpenter hadde E wedded newe a wyf
This Carpenter hadde newli wedded a wyf
This Carpenter hadde wedded a wyf
4.
- He was agast
so 33 witnesses
- He was agast
4 witnesses
- So he was
agast 6 witnesses
- He was so
agast 7 witnesses
- He was agast and feerd 2 witnesses
- So was he
agast 1 witness
5. Time for some XML:
<l id=”MI-35-El” n=”35″>¶ph; <w
n=”1″>This</w> <w n=”2″>Carpente&rtail;</w> <w
n=”3″>hadde</w> <w n=”4″>wedded</w> <w
n=”5″>newe</w> <w n=”6″>a</w> <w n=”7″>wyf</w>
</l>
<l id=”MI-35-Ii” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpenter</w> <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>a</w> <w
n=”6″>newe</w> <w n=”7″>wi&ftail;</w> </l>
<l id=”MI-35-Cn” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpenter</w> <w n=”3″>had</w> <w
n=”4″>newe</w> <w n=”5″>wedde&dtail;</w> <w
n=”6″>awif</w> </l>
<l id=”MI-35-Cp” n=”35″><w n=”1″>This</w> <w
n=”2″>Carpunter</w> <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>a</w> <w
n=”6″>newe</w> <w n=”7″>wy&ftail;</w> </l>
<l id=”MI-35-Hg” n=”35″>¶ph; <w n=”1″>This</w> <w
n=”2″>Carpenter</w> &virgule; <w n=”3″>hadde</w> <w
n=”4″>wedded</w> <w n=”5″>newe</w> <w
n=”6″>a</w> <w n=”7″>wyf</w> </l>
<l id=”MI-35-Gg” n=”35″><w n=”1″>This</w> <w
n=”2″>carpenter</w> <w n=”3″>hadde</w> <w
n=”4″>weddid</w> <w n=”5″>newe</w> <w
n=”6″>a</w> <w n=”7″>wyf</w> </l>
6 Some more XML, this time
with more encoding
<l id=”MI-1-Bo1″ n=”1″><w n=”1″><hi rend
=”unex” ht=”3″>w</hi><hi rend=”ul”>Hilom</hi></w>
<w n=”2″>ther</w> <w n=”3″>was</w> <w
n=”4″>duelling</w> <w n=”5″>in</w> <w
n=”6″>Oxenford</w> <note
<l id=”MI-1-Cx1″ n=”1″><w n=”1″><hi ht=”2″ rend=”other”>W</hi>Hilom</w>
<w n=”2″>therwas</w> <w n=”3″>dwellyn>ail;</w>
<w n=”4″>in</w> <w n=”5″>Oxenforde</w> </l>
<l id=”MI-1-Bw” n=”1″><w n=”1″><hi ht=”5″
rend=”orncp”>W</hi>ylom</w> <w n=”2″>þer</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>in</w> <w n=”6″>Oxenford</w> </l>
<l id=”MI-1-Ch” n=”1″><w n=”1″><hi ht=”2″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>at</w> <w n=”6″>Oxenforde</w> </l>
<l id=”MI-1-Dd” n=”1″><w n=”1″><hi ht=”5″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>there</w>
<w n=”3″>was</w> <w n=”4″>dwellyng</w> &virgule;
<w n=”5″>in</w> <w n=”6″>Oxenfor&dtail;</w>
</l>
<l id=”MI-80-Bo2″ n=”80″><w n=”1″>As</w> <w
n=”2″>brode</w> <w n=”3″>as</w> <w
n=”4″>is</w> <w n=”5″>þe</w> <w n=”6″>boos</w>
<w n=”7″>of</w> <w n=”8″>a</w> <w
n=”9″>bokelyr</w> </l>
<l id=”MI-1-Ad3″ n=”1″><w n=”1″><hi ht=”4″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>ther</w> <w
n=”3″>was</w> <w n=”4″>dwellyng</w> <w
n=”5″>in</w> <w n=”6″>Oxenford</w> </l>
<l id=”MI-1-Cp” n=”1″><w n=”1″><hi ht=”6″
rend=”orncp”>W</hi>hilom</w> <w n=”2″>þer</w> <w
n=”3″>was</w> <w n=”4″>dwellyn>ail;</w> <w
n=”5″>at</w> <w n=”6″>Oxenfoor&dtail;</w> </l>
enough, now!
Email announcing these posts
Sent 6 February 2007 to Joris, Fotis, and other participants in the January 2007 The Hague meeting:
Dear everyone
at the meeting a week last Friday, we had a good deal of discussion about the future of Collate. Since the meeting, very much under the influence of Barbara, I have undergone a massive conversion.
Barbara pointed out that at the meeting, I seemed to be resisting the idea of handing over Collate for others to develop. Indeed I was, and as Barbara pointed out: I was doing exactly what I forever denounce other people for doing: saying 'this is mine! hands off!!'.
So now, I have seen the light. And indeed, the more I think of it: this is an ideal project for us all to collaborate on. And I was really impressed with the enthusiasm in The Hague for doing this together. So here is my suggestion: we develop the next Collate (which I suggest should be called CollateXML) together. To start off the process, I have created a blog
Here you will find a whole series of materials now about Collate, thus:
An introduction
A history of the three earlier versions of Collate
A design outline for CollateXML
How CollateXML should work
Some Datastructures needed for Collate
There will be more to come, but this will get us running! I will put up the code for Collate2, some of it, though I doubt this will be as useful as the explanations given on the blog site.
I'd be glad to hand this all over to you folks. I'm sure happy to help out, and provide test and sample files, etc etc, and offer lots more advice where I felt it might help. I'd say we could set this up as a source forge project and, well, get on with it. One place to start would be with implementing the fundamental word by word collation algorithm set out in the 'How CollateXML should work' should work.
Well, I've thrown out the stone into the pond now. So folks, who is ready to give up a few years of their life getting this to run???
all the best
Peter