Distance


Distance methods are also known as "minimum evolution" methods. This is a misnomer from the point of view that it does not distinguish these methods from others like parsimony and likelihood which too are forms of minimizing evolution.

Consider the data set:

I   aagtcatgct
II  aaatcaggct
III cagacagtca
IV  cacactgcca
This yields the pairwise p(ij) distance matrix:
Uncorrected ("p") distance matrix
       I    II  III IV
I       -
II   0.20    -
III  0.50 0.50    -
IV   0.70 0.60 0.30  -
If we wish to calculate the best fitting "p" distance to say the three taxon tree
       I
       |
       a
       |
       .
      / \      
     b   c
    /     \
   II     III
Then
a = [p(I,II) + p(I,III) - p(II,III)]/2
b = [p(II,I) + p(II,III) - p(I,III)]/2
c = [p(III,I) + p(III,II) - p(I,II)]/2
or
a = [0.20 + 0.50 - 0.50]/2 = 0.10
b = [0.20 + 0.50 - 0.50]/2 = 0.10
c = [0.50 + 0.50 - 0.20]/2 = 0.40
or
       I
       |
      0.10
       |
       .
      / \      
   0.10  \
    /    0.40
   II      \
            \
            III
Notice that the measure must satisfy the triangle inequality (must be metric) but need not be ultrametric (i.e., where a = b = c).

Initially, methods like that of Fitch and Margoliash were designed to find the tree that provided the least distortion of the (t(t-1)/2) pairwise distances of t taxa.

The distortion, like the cophenetic correlation coefficient, is the difference between the observed matrix above and a matrix derived form the resulting tree.

For the three taxon analysis above there is no distortion whatsoever.

Things are never so clean with more than three taxa. Consider the following possible four-taxon trees for the same matrix:

They imply,

        First Tree         Second Tree        Third tree
      I    II  III IV    I    II  III IV    I    II  III IV 
I       -                  -                  -              
II   0.20    -          0.23    -          0.23    -         
III  0.52 0.48    -     0.30 0.50    -     0.50 0.30    -    
IV   0.67 0.63 0.30  -  0.70 0.31 0.37  -  0.31 0.60 0.37  - 

And remember the original matrix was:

Uncorrected ("p") distance matrix
       I    II  III IV
I       -
II   0.20    -
III  0.50 0.50    -
IV   0.70 0.60 0.30  -

The distortion could be measured as the sum of the absolute differences, or :

        First Tree         Second Tree        Third tree
          0.01                0.59              0.69
So clearly, then the first tree yields a pairwise matrix that is most-similar to the original set of pairwise distances. But absolute differences are not always what is used.

The cophenetic correlation coefficient which was used by phenetics was supplanted by the distortion coefficient E where

E = T-1
Σ
i
T
Σ
j=i+1
wij |dij-pij|a
and T = number of taxa, w is some weighting function, dij is the observed pairwise distance between taxa i and j, pij is the path-length derived distance between taxa i and j, and if a is 1, then absolute distance, but if a = 2, then weighted least-squares distance.
  1. if wij = 1, then all distances are expected to have the same error (Cavalli-Sforza & Edwards, 1967)
  2. if wij = 1/dij, then error is expected to be proportional to the observed distance (Fitch-Margoliash, 1967)
  3. if wij = (1/dij)2 then expected error is proportional to the square-root of the observed distance (Felsenstein, 1993).

A weighted least squares distortion is :

        First Tree         Second Tree        Third tree
          0.0026              0.1299            0.1979
And again the first tree yields a pairwise weighted matrix that is most-similar to the original set of pairwise distances.

Minimum Evolution


Rather than choosing the tree with the least distortion of the original matrix, one could, of course, just choose the tree that minimizes the sum of the branch lengths. Even if this tree has more distorion, it can be argued to be the most efficient tree. This is the method of "Minimum Evolution" or ME.

Rzetski and Nei (1992) took credit for a modification of the distance method in which the distortion was ignored and path-lengths were merely minimized. However, Kidd and Sgaramella-Zonta (1971) actually had the idea first. This method yields:

        First Tree         Second Tree        Third tree
          0.823               0.903             0.852

And again the first tree is preferred. However, you should note that the third tree is second-best this time whereas in the distortion methods the second tree was second-best.

Neighbor Joining

Neighbor joining is similar to ME, but instead of searching for the best ME tree it simply does a one-pass ME joining of closest taxa (like UPGMA) and does not even claim to get the best tree under the ME criterion. It's not clear why you would want to use it. Here's what the author of the method says about it...
"This method (Saitou and Nei 1987) is a simplified version of the minimum evolution (ME) method (Saitou and Imanishi 1989, Rzhetsky and Nei 1992)....
However, construction of a minimum evolution tree is time-consuming...
In the case of the NJ method, the S value is not computed for all or many topologies,... only one final tree is produced.
As mentioned above, the NJ tree is usually the same as the ME tree when the number of OTUs is small. However, if this number is large and the extent of sequence divergence is small, the topological difference between the NJ and ME trees can be substantial (Rzhetsky and Nei 1993)."

So... it's better when you have lots of taxa because it's faster and gives you one tree, but then it actually is admitted to do a poor job when there are lots of taxa.

Problems,

  1. NEGATIVE BRANCH LENGTHS - in all of these methods, since the distances are additive, the best (or some other) tree may require negative branch lengths. There is no biological meaning to negative evolution and there is no biologically meaningful solution to this problem. You have the choice of
    1. leaving them alone,
    2. setting them arbitrarily to zero, or
    3. taking the absolute value
  2. TREATS dij AND pij AS THOUGH THEY WERE INDEPENDENT QUANTITIES. They are not of course. The path from II to III in the first tree is not independent of the path from II to IV.
    Instead of minimizing the E function above, one simply minimizes the sum of the path-lengths ( v ).
  3. GAPS - no logical way to include gaps in assessment of sequences. If gaps are randomly distributed (doubtful) they can be pair-wise deleted. Usually, though they must be list-wise deleted with the loss of considerable information.
  4. REALISM - consider
    I   a
    II  c
    III g
    
    The best tree has a total of 1.5 changes on it. But we know there are two changes. Also, it is not entirely clear what half-of-a-change should mean when nucleotides change as units. That is, the ancestor (in the middle) was no doubt something other than 1/5 a nucleotide different from I, II, and III.

Distances

So far we have been considering only p distances. It was pointed out (by Farris) that distances could not be actual estimates of amount of change for the reasons (among others) detailed above. It was then argued (esp. by Felsenstein) that rather than being "actual" measures of distance on a tree, these values should be interpreted as expected amounts of change across all sites. This then is only meaningful if
a) there is some process of change that is common to all of the sites and
b) that one both knows it and can effectively model it.

Process models

Given that there are only 4 nucleotides to choose from, and assuming stochasticity, with a random assigment of nucleotides to two taxa you'd expect them to be 75% different. Here's why. The following is all possible combinations of assignment of nucleotides to two taxa.

A A
A C
A G
A T
C A
C C
C G
C T
G A
G C
G G
G T
T A
T C
T G
T T
Thus, if two taxa are so far apart that (under a stochastic process) they are effectively randomized with respect to each other, the'd still be expected to appear to be 25% similar (= distance of 0.75).

This model is described as:

Jukes Cantor = - 3/4 ln(1-(4p/3))

where p is the p distance (above). Thus the maximum value it can take is 0.75 (which is what you expect with random assignment of nucleotides to a pair of taxa).

The p distance can exceed 0.75 if base substitutions are unequal; that is if the choice of nucleotides is not equal among the four. If it does, then the Jukes Cantor distance becomes undefined. That is, ln (1 - (4(0.8)/3)) = ln (- 0.07) = ? .

The Felsenstein 81 model was designed to counter this annoying problem:

F81 = - (1 - (πa2 + πc2 + πg2 + πt2))ln(1-(p/(πa2 + πc2 + πg2 + πt2)))

where, for example, πt is the average proportional thymidine base composition

  1. between taxa i and j, or
  2. across all taxa,
but usually the latter due to lower variance.

The Jukes Cantor distance can also be undefined if the number of transitions and transversions are unequal. The Kimura two-parameter distance circumvents that little annoyance:

K2P = 1/2 ln(1/(1-2P-Q)) + 1/4 ln (1/(1-2Q))

where P = proportion of changes that are transitions and Q = proportion of changes that are transversions as determined from pairwise comparisons.

The K2P distance does not take into account different base compositions though. Other distances like the Felsenstein 84 or HKY85 distance modify the distance further to prevent it from being undefined.

Here is an example. Above the diagonal are the uncorrected p distances, below the diagonal are the K2P corrected distances. You will note that the p distance is an underestimate fo the K2P.

	         A      B     C     D     E 

  Species A	      0.20  0.50  0.45  0.40		 
  Species B	0.23        0.40  0.55  0.50	       
  Species C	0.87  0.59        0.15  0.40		 
  Species D	0.73  1.12  0.17        0.25
  Species E	0.59  0.89  0.61  0.31
For a more thorough discussion of various distance methods see the Mega Manual