Consider:
LOON: RED EYES, FEATHERS, 28 VERTEBRAE DOG: HAIR, 23 VERTEBRAE, BROWN EYES CROC: 28 VERTEBRAE, GREEN EYES, SCALESWe would readily construct the matrix:
LOON: 000 DOG: 111 CROC: 220But with DNA data matters are more complicated because each possible character has the same 4 possible states (A, C, G, T).
Thus,
LOON: ACTTCCGAATTTGGCT DOG: ACTCGATTGCCTdoes not immediately indicate how the states in DOG should be contextually homologized with the states in LOON.
By way of example, in the seminar two different assessments of contextual homology were offered by participants:
ACTTCCGAATTTGG-CT ||| ||| ||| || ACT--CGA--TTG-CCT ACTTCCGGAATTTGGCT |||* *||| |*|| ACTC----GATT-GCCTWhich one is correct?
When asked how they arrived at this by-eye alignment, the first replied that it was done in a manner that would minimize base-substitutions, the second indicated that it was done with an eye to minimizing insertions/deletions (INDELS).
But in fact, the second one does not minimize indels, this does:
ACTTCCGGAATTTGGCT |||* **|||*|| ACTC-----GATTGCCTSo in fact, the second of the two alignments really attempts to balance the amount of indels with the amount of base substitution.
The problems with by-eye alignment are
LOON: ACTTCCGAATTTGGCT DOG: ACTCGATTGCCTand an assessment of the cost for the two the alignments that follow
| alignment | 1 subst. costs | 1 gap costs | Final Cost |
LOON: ACTTCCGAATTTG-GCT
||| ||| || | ||
DOG: ACT--CGA-TT-GC-CT |
1 | 1 | 0(1)+5(1) = 5 |
LOON: ACTTCCGAATTTGGCT
|||* *||| |*||
DOG: ACTC---GATT-GCCT |
1 | 1 | 3(1)+2(1) = 5 |
So, you see that if gaps cost the same as base substitutions, all of the disagreements between two sequences can be explained by insertions and deletions (c.f., the first alignment). Unfortunately, though, this comes at the expense of all base substitutions, and thus at the expense of any phylogenetic information.
Usually you will want to set the cost of an indel (gap) to be higher than the cost of a substitution:
| alignment | 1 subst. costs | 1 gap costs | Final Cost |
LOON: ACTTCCGAATTTG-GCT
||| ||| || | ||
DOG: ACT--CGA-TT-GC-CT |
1 | 2 | 0(1)+5(2) = 10 |
LOON: ACTTCCGAATTTGGCT
|||* *||| |*||
DOG: ACTC---GATT-GCCT |
1 | 2 | 3(1)+2(2) = 7 |
Let's expand this to consider also the alignment that minimizes indels
| alignment | 1 subst. costs | 1 gap costs | Final Cost |
LOON: ACTTCCGAATTTG-GCT
||| ||| || | ||
DOG: ACT--CGA-TT-GC-CT |
1 | 2 | 0(1)+5(2) = 10 |
LOON: ACTTCCGAATTTGGCT
|||* *||| |*||
DOG: ACTC---GATT-GCCT |
1 | 2 | 3(1)+2(2) = 7 |
LOON: ACTTCCGAATTTGGCT
|||* **|||*||
DOG: ACTC----GATTGCCT |
1 | 2 | 4(1)+1(2) = 6 |

So, that solves the length problem, but then, of course, if you were to do this for, oh, say, 75 taxa, you'd have to envisage 75 dimensional space!!!
Forget it....
Rather than simultaneously aligning all taxa, we have to do it step-wise following an order of alignment.
Some people suggest that one should align sequences of closely related taxa first (see esp., Mindell, D. 1991. Aligning DNA sequences: homology and phylogenetic weighting. in M. J. Miyamoto and J. Cracraft, eds. Phylogenetic Analysis of DNA Sequences. Oxford University Press, New York. pp. 73-89). But, obviously, then, one's preconceived notions of phylogeny, which direct the order of alignment, will then be self-fulfilling prophesies should those taxa group together in resulting phylogenetic analyses (duh).
One method would be to simply align them in the order that they appear in the unaligned sequence-containing file as is done in PILEUP, but then this is not likely to be terribly efficient at getting the best alignment, and may even cause your phylogenetic tree to be biased by alphabetical order.
Obviously, one could align them in the order of decreasing pairwise similarity. In this case, as with CLUSTAL, a UPGMA tree based on pairwise alignments determines the alignment order. Of course, the pairwise similarities could be modified into a distance tree using a Fitch-Margoliash, Jukes-Cantor or other such measure to determine the alignment order (more recent versions of CLUSTAL and TreeAlign use this for example. Accordingly it shouldn't be surprising that MALIGN, for example, determines the order of alignment using the wagner algorithm and a parsimonious algorithm (with swapping etc). .
It could be argued that it doesn't make a whole lot of sense to determine alignment order with one optimality criterion (e.g., phenetics) and then analyse the alignment later with another (e.g., parsimony). It could also be argued that it would be interesting to examine the differences these might mean. That is, if a most-parsimonious tree and UMPGMA based on the same pehnetic alignment agree more than the most-parsimonious tree and UMPGMA based on a parsimonious alignment, one might conclude that the phenetic alignment was unduly affecting the parsimony procedure.
LOON: AAC DOG: ACA CROC: CCA RAT: CACThere is one difference (two states) in each of the columns, thus the column-score for the alignment is 3.
However, it is possible to interpret the alignment in a transformational context (that is, in terms of what is possible given that they cannot all be each others closest relatives). There's no reason why this couldn't be done in a likelihood framework, but it has not yet been. In a parsimony framework, it woud be impossible to get only 3 steps on any given tree. Rather, the cost for all possible trees, or the cladogram-score, is 4.
To clarify this a little more, an alignment such as
I ACCGTTGGA II AC-GTCTGA III AC-GTC-AG IV AC-GTT-AGmight be arrived at by way of a wagner-algorithm step-wise adition of taxa in this order (I with II), then (III with IV), then these two sets connected, or in summary the alignment order is
((I II)(III IV)).This then needs to be evaluated for its cost. There are three trees for the four taxa that the alignment can be evaluated on. Considering the gaps to be missing (which is not necessary) the various trees require 5, 6 or 7 steps. Thus the best score for this alignment is 5 (more complicated datasets would require cost evaulation by swapping on trees, of course, as opposed to evaluating all possible topologies as we have here).
We can then swap on the alignment, that is instead of the alignment-topology ((I II)(III IV)), we swap I with III and get the alignment order
((III II)(I IV)),which suppose gives us this slightly different alignment:
I ACCGTTGGA II AC-GTCTGA III AC-GTCAG- IV AC-GTTAG-again, assessing the cost of this alignment on the various tree topologies gives 4, 4, or 3 as the number of steps, thus the best score for this alignment is 3.
A summary of just a few of the various parameters that you need to think about with alignment includes: