Multiple Alignment


With morphological data it is easy to determine which states in one taxon go with which states in another taxon because generally, the available states are defined by the context of the character.

Consider:

LOON: RED EYES, FEATHERS, 28 VERTEBRAE
DOG:  HAIR, 23 VERTEBRAE, BROWN EYES
CROC: 28 VERTEBRAE, GREEN EYES, SCALES
We would readily construct the matrix:
LOON: 000
DOG:  111
CROC: 220
But with DNA data matters are more complicated because each possible character has the same 4 possible states (A, C, G, T).

Thus,

 LOON: ACTTCCGAATTTGGCT
  DOG: ACTCGATTGCCT
does not immediately indicate how the states in DOG should be contextually homologized with the states in LOON.

By way of example, in the seminar two different assessments of contextual homology were offered by participants:

ACTTCCGAATTTGG-CT
|||  |||  |||  ||
ACT--CGA--TTG-CCT

ACTTCCGGAATTTGGCT
|||*    *||| |*||
ACTC----GATT-GCCT
Which one is correct?

When asked how they arrived at this by-eye alignment, the first replied that it was done in a manner that would minimize base-substitutions, the second indicated that it was done with an eye to minimizing insertions/deletions (INDELS).

But in fact, the second one does not minimize indels, this does:

ACTTCCGGAATTTGGCT
|||*     **|||*||
ACTC-----GATTGCCT
So in fact, the second of the two alignments really attempts to balance the amount of indels with the amount of base substitution.

The problems with by-eye alignment are

  1. different alignments can lead to different phylogentic trees and yet these differences are merely a matter of opinion, not empirical science
  2. it is hard if not impossible to align by-eye without allowing one's pre-conceived notions of relationships to come into play
Thus, what is needed is some objective function, and method of aligning sequences which is repeatable and which is logical. It will surprise no one that just like the by-eye method, algorithmic multiple alignment is based on the trade off between the cost of a base substitution and the cost of an indel.


Algorithmic Alignment

Consider now,
  LOON: ACTTCCGAATTTGGCT
   DOG: ACTCGATTGCCT
and an assessment of the cost for the two the alignments that follow
alignment1 subst. costs1 gap costsFinal Cost
  LOON: ACTTCCGAATTTG-GCT
        |||  ||| || |  ||
   DOG: ACT--CGA-TT-GC-CT
110(1)+5(1) = 5
  LOON: ACTTCCGAATTTGGCT
        |||*   *||| |*||
   DOG: ACTC---GATT-GCCT
113(1)+2(1) = 5

So, you see that if gaps cost the same as base substitutions, all of the disagreements between two sequences can be explained by insertions and deletions (c.f., the first alignment). Unfortunately, though, this comes at the expense of all base substitutions, and thus at the expense of any phylogenetic information.

Usually you will want to set the cost of an indel (gap) to be higher than the cost of a substitution:
alignment1 subst. costs1 gap costsFinal Cost
  LOON: ACTTCCGAATTTG-GCT
        |||  ||| || |  ||
   DOG: ACT--CGA-TT-GC-CT
120(1)+5(2) = 10
  LOON: ACTTCCGAATTTGGCT
        |||*   *||| |*||
   DOG: ACTC---GATT-GCCT
123(1)+2(2) = 7
and here, now, the second of the two alignments is preferred because it has the lower cost.

Let's expand this to consider also the alignment that minimizes indels
alignment1 subst. costs1 gap costsFinal Cost
  LOON: ACTTCCGAATTTG-GCT
        |||  ||| || |  ||
   DOG: ACT--CGA-TT-GC-CT
120(1)+5(2) = 10
  LOON: ACTTCCGAATTTGGCT
        |||*   *||| |*||
   DOG: ACTC---GATT-GCCT
123(1)+2(2) = 7
  LOON: ACTTCCGAATTTGGCT
        |||*    **|||*||
   DOG: ACTC----GATTGCCT
124(1)+1(2) = 6
obviously, the last alignment is actually preferred.


The Size Problem

Above, we have considered only short sequences and only two taxa. This is fine so far as it goes because we can readily visualize different possible alignments. In reality this is not so easy. In the first place, sequences are long and one needs an efficient way to discover where indels are.
For two taxa this can be visualized as a 2 dimensional matrix with one taxon's sequence running across the top and the other's running down the left side. Where bases match, there's a correlation indicated in the appropriate cartesian coordinates, where they do not, there isn't.


The red circles above delimit indels.

So, that solves the length problem, but then, of course, if you were to do this for, oh, say, 75 taxa, you'd have to envisage 75 dimensional space!!!

Forget it....

Rather than simultaneously aligning all taxa, we have to do it step-wise following an order of alignment.


Order of Alignment

Sequences that are aligned to each other first are more likely to group closer to each other in resulting phylogenies simply as an artefact of the order of alignment.

Some people suggest that one should align sequences of closely related taxa first (see esp., Mindell, D. 1991. Aligning DNA sequences: homology and phylogenetic weighting. in M. J. Miyamoto and J. Cracraft, eds. Phylogenetic Analysis of DNA Sequences. Oxford University Press, New York. pp. 73-89). But, obviously, then, one's preconceived notions of phylogeny, which direct the order of alignment, will then be self-fulfilling prophesies should those taxa group together in resulting phylogenetic analyses (duh).

One method would be to simply align them in the order that they appear in the unaligned sequence-containing file as is done in PILEUP, but then this is not likely to be terribly efficient at getting the best alignment, and may even cause your phylogenetic tree to be biased by alphabetical order.

Obviously, one could align them in the order of decreasing pairwise similarity. In this case, as with CLUSTAL, a UPGMA tree based on pairwise alignments determines the alignment order. Of course, the pairwise similarities could be modified into a distance tree using a Fitch-Margoliash, Jukes-Cantor or other such measure to determine the alignment order (more recent versions of CLUSTAL and TreeAlign use this for example. Accordingly it shouldn't be surprising that MALIGN, for example, determines the order of alignment using the wagner algorithm and a parsimonious algorithm (with swapping etc). .

It could be argued that it doesn't make a whole lot of sense to determine alignment order with one optimality criterion (e.g., phenetics) and then analyse the alignment later with another (e.g., parsimony). It could also be argued that it would be interesting to examine the differences these might mean. That is, if a most-parsimonious tree and UMPGMA based on the same pehnetic alignment agree more than the most-parsimonious tree and UMPGMA based on a parsimonious alignment, one might conclude that the phenetic alignment was unduly affecting the parsimony procedure.


Equally optimal alignments

You'll recall that the neighbor joining procedure was a form of one-pass distance analysis but did not search for equally optimal (or more optimal) solutions than found on that first pass. Similarly, most alignment software including PILEUP, CLUSTAL and TreeAlign, will give you a single multiple alignment.
The problem is that there may be equally optimal alignments for the same data no matter what your optimality criterion is. Only MALIGN will give you multiple equally optimal alignments, and it can do so for a pairwise alignment procedure (phenetic) or for a searched and swapped parsimony based procedure.

Determining score

Most alignment algorithms determine the cost of an alignment column-wise. That is, for example, given the alignment :
LOON: AAC
 DOG: ACA
CROC: CCA
 RAT: CAC
There is one difference (two states) in each of the columns, thus the column-score for the alignment is 3.

However, it is possible to interpret the alignment in a transformational context (that is, in terms of what is possible given that they cannot all be each others closest relatives). There's no reason why this couldn't be done in a likelihood framework, but it has not yet been. In a parsimony framework, it woud be impossible to get only 3 steps on any given tree. Rather, the cost for all possible trees, or the cladogram-score, is 4.

Wheeler and Gladstein have extended this application in terms of cladogram costs. The premise is that the alignment that gives the lowest cost on the best tree that can be found for that alignment is the most optimaly parsimonious alignment.

To clarify this a little more, an alignment such as

I   ACCGTTGGA
II  AC-GTCTGA
III AC-GTC-AG
IV  AC-GTT-AG
might be arrived at by way of a wagner-algorithm step-wise adition of taxa in this order (I with II), then (III with IV), then these two sets connected, or in summary the alignment order is
((I II)(III IV)).
This then needs to be evaluated for its cost. There are three trees for the four taxa that the alignment can be evaluated on. Considering the gaps to be missing (which is not necessary) the various trees require 5, 6 or 7 steps. Thus the best score for this alignment is 5 (more complicated datasets would require cost evaulation by swapping on trees, of course, as opposed to evaluating all possible topologies as we have here).

We can then swap on the alignment, that is instead of the alignment-topology ((I II)(III IV)), we swap I with III and get the alignment order

((III II)(I IV)),
which suppose gives us this slightly different alignment:
I   ACCGTTGGA
II  AC-GTCTGA
III AC-GTCAG-
IV  AC-GTTAG-
again, assessing the cost of this alignment on the various tree topologies gives 4, 4, or 3 as the number of steps, thus the best score for this alignment is 3.
Because the second alignment is more optimal than the first, we discard the first, keep the second and swap on the alignment topology again.
We can keep doing this until we are satisfied that we can't get a better alignment.


Parameters

Most people like to point an shoot. Alas, good science is more difficult than this.

A summary of just a few of the various parameters that you need to think about with alignment includes:

  1. cost of a substitution

  2. cost of an indel

  3. should it cost the same for indels of different sizes?

  4. optimality criterion for alignment

  5. optimality criterion for assessing score

  6. if cladogram cost, do gaps count in the score?

  7. how thoroughly do you want to search different alignment topologies?

  8. if cladogram cost, how thoroughly do you want to search for optimal trees for an alignment?

  9. how many equally optimal alignments will you allow the algorithm to search for?

  10. if cladogram cost, how many trees to evaulate score on?

In order to help you get a handle on some of this, there is MalignOnlign and you should read the Malign Manual.