Consider the data set:

I aagtcatgct II aaatcaggct III cagacagtca IV cacactgccaThis yields the pairwise

Uncorrected ("p") distance matrix I II III IV I - II 0.20 - III 0.50 0.50 - IV 0.70 0.60 0.30 -If we wish to calculate the best fitting "p" distance to say the three taxon tree

I | a | . / \ b c / \ II IIIThen

or

or

I | 0.10 | . / \ 0.10 \ / 0.40 II \ \ IIINotice that the measure must satisfy the triangle inequality (must be metric) but need not be ultrametric (i.e., where a = b = c).

Initially, methods like that of Fitch and Margoliash were designed to find the tree that provided the least distortion of the (^{t(t-1)}/_{2}) pairwise distances of t taxa.

The distortion, like the cophenetic correlation coefficient, is the difference between the observed matrix above and a matrix derived form the resulting tree.

For the three taxon analysis above there is no distortion whatsoever.

Things are never so clean with more than three taxa. Consider the following possible four-taxon trees for the same matrix:

They imply,

First Tree Second Tree Third tree I II III IV I II III IV I II III IV I - - - II 0.20 - 0.23 - 0.23 - III 0.52 0.48 - 0.30 0.50 - 0.50 0.30 - IV 0.67 0.63 0.30 - 0.70 0.31 0.37 - 0.31 0.60 0.37 -

And remember the original matrix was:

Uncorrected ("p") distance matrix I II III IV I - II 0.20 - III 0.50 0.50 - IV 0.70 0.60 0.30 -

The distortion could be measured as the sum of the absolute differences, or :

First Tree Second Tree Third tree 0.01 0.59 0.69So clearly, then the first tree yields a pairwise matrix that is most-similar to the original set of pairwise distances. But absolute differences are not always what is used.

The cophenetic correlation coefficient which was used by phenetics was supplanted by the distortion coefficient E where

E = |
T-1 Σ i |
T Σ j=i+1 |
w_{ij} |d_{ij}-p_{ij}|^{a} |

- if w
_{ij}= 1, then all distances are expected to have the same error (Cavalli-Sforza & Edwards, 1967) - if w
_{ij}= 1/d_{ij}, then error is expected to be proportional to the observed distance (Fitch-Margoliash, 1967) - if w
_{ij}= (1/d_{ij})^{2}then expected error is proportional to the square-root of the observed distance (Felsenstein, 1993).

A weighted least squares distortion is :

First Tree Second Tree Third tree 0.0026 0.1299 0.1979And again the first tree yields a pairwise weighted matrix that is most-similar to the original set of pairwise distances.

Rather than choosing the tree with the least distortion of the original matrix, one could, of course, just choose the tree that minimizes the sum of the branch lengths. Even if this tree has more distorion, it can be argued to be the most efficient tree. This is the method of "Minimum Evolution" or ME.

Rzetski and Nei (1992) took credit for a modification of the distance method in which the distortion was ignored and path-lengths were merely minimized. However, Kidd and Sgaramella-Zonta (1971) actually had the idea first. This method yields:

First Tree Second Tree Third tree 0.823 0.903 0.852

And again the first tree is preferred. However, you should note that the third tree is second-best this time whereas in the distortion methods the second tree was second-best.

"This method (Saitou and Nei 1987) is a simplified version of the minimum evolution (ME) method (Saitou and Imanishi 1989, Rzhetsky and Nei 1992)....

However, construction of a minimum evolution tree is time-consuming...

In the case of the NJ method, the S value is not computed for all or many topologies,... only one final tree is produced.

As mentioned above, the NJ tree is usually the same as the ME tree when the number of OTUs is small. However, if this number is large and the extent of sequence divergence is small, the topological difference between the NJ and ME trees can be substantial (Rzhetsky and Nei 1993)."

So... it's better when you have lots of taxa because it's faster and gives you one tree, but then it actually is admitted to do a poor job when there are lots of taxa.

**Problems,**

- NEGATIVE BRANCH LENGTHS - in all of these methods, since the distances are additive, the best (or some other) tree may require negative branch lengths. There is no biological meaning to negative evolution and there is no biologically meaningful solution to this problem. You have the choice of
- leaving them alone,
- setting them arbitrarily to zero, or
- taking the absolute value

- TREATS d
_{ij}AND p_{ij}AS THOUGH THEY WERE INDEPENDENT QUANTITIES. They are not of course. The path from II to III in the first tree is not independent of the path from II to IV.- Instead of minimizing the E function above, one simply minimizes the sum of the path-lengths (
*v*).- GAPS - no logical way to include gaps in assessment of sequences. If gaps are randomly distributed (doubtful) they can be pair-wise deleted. Usually, though they must be list-wise deleted with the loss of considerable information.
- REALISM - consider

The best tree has a total of 1.5 changes on it. But we know there are two changes. Also, it is not entirely clear what half-of-a-change should mean when nucleotides change as units. That is, the ancestor (in the middle) was no doubt something other than 1/5 a nucleotide different from I, II, and III.I a II c III g

**Process models**

Given that there are only 4 nucleotides to choose from, and assuming stochasticity, with a random assigment of nucleotides to two taxa you'd expect them to be 75% different. Here's why. The following is all possible combinations of assignment of nucleotides to two taxa.

This model is described as:

The p distance can exceed 0.75 if base substitutions are unequal; that is if the choice of nucleotides is not equal among the four. If it does, then the Jukes Cantor distance becomes undefined. That is, ln (1 - (4(0.8)/3)) = ln (- 0.07) = ? .

The Felsenstein 81 model was designed to counter this annoying problem:

where, for example, π_{t} is the average proportional thymidine base composition

- between taxa i and j, or
- across all taxa,

The Jukes Cantor distance can also be undefined if the number of transitions and transversions are unequal. The Kimura two-parameter distance circumvents that little annoyance:

where P = proportion of changes that are transitions and Q = proportion of changes that are transversions as determined from pairwise comparisons.

The K2P distance does not take into account different base compositions though. Other distances like the Felsenstein 84 or HKY85 distance modify the distance further to prevent it from being undefined.

Here is an example. Above the diagonal are the uncorrected p distances, below the diagonal are the K2P corrected distances. You will note that the p distance is an underestimate fo the K2P.

A B C D E Species A 0.20 0.50 0.45 0.40 Species B 0.23 0.40 0.55 0.50 Species C 0.87 0.59 0.15 0.40 Species D 0.73 1.12 0.17 0.25 Species E 0.59 0.89 0.61 0.31For a more thorough discussion of various distance methods see the Mega Manual