Consider the data set:
I aagtcatgct II aaatcaggct III cagacagtca IV cacactgccaThis yields the pairwise p(ij) distance matrix:
Uncorrected ("p") distance matrix
I II III IV
I -
II 0.20 -
III 0.50 0.50 -
IV 0.70 0.60 0.30 -
If we wish to calculate the best fitting "p" distance to say the three taxon treeI | a | . / \ b c / \ II IIIThen
I | 0.10 | . / \ 0.10 \ / 0.40 II \ \ IIINotice that the measure must satisfy the triangle inequality (must be metric) but need not be ultrametric (i.e., where a = b = c).
Initially, methods like that of Fitch and Margoliash were designed to find the tree that provided the least distortion of the (t(t-1)/2) pairwise distances of t taxa.
The distortion, like the cophenetic correlation coefficient, is the difference between the observed matrix above and a matrix derived form the resulting tree.
For the three taxon analysis above there is no distortion whatsoever.
Things are never so clean with more than three taxa. Consider the following possible four-taxon trees for the same matrix:
They imply,
First Tree Second Tree Third tree I II III IV I II III IV I II III IV I - - - II 0.20 - 0.23 - 0.23 - III 0.52 0.48 - 0.30 0.50 - 0.50 0.30 - IV 0.67 0.63 0.30 - 0.70 0.31 0.37 - 0.31 0.60 0.37 -
And remember the original matrix was:
Uncorrected ("p") distance matrix
I II III IV
I -
II 0.20 -
III 0.50 0.50 -
IV 0.70 0.60 0.30 -
The distortion could be measured as the sum of the absolute differences, or :
First Tree Second Tree Third tree 0.01 0.59 0.69So clearly, then the first tree yields a pairwise matrix that is most-similar to the original set of pairwise distances. But absolute differences are not always what is used.
The cophenetic correlation coefficient which was used by phenetics was supplanted by the distortion coefficient E where
| E = |
T-1 Σ i |
T Σ j=i+1 | wij |dij-pij|a |
A weighted least squares distortion is :
First Tree Second Tree Third tree 0.0026 0.1299 0.1979And again the first tree yields a pairwise weighted matrix that is most-similar to the original set of pairwise distances.
Rzetski and Nei (1992) took credit for a modification of the distance method in which the distortion was ignored and path-lengths were merely minimized. However, Kidd and Sgaramella-Zonta (1971) actually had the idea first. This method yields:
First Tree Second Tree Third tree 0.823 0.903 0.852
"This method (Saitou and Nei 1987) is a simplified version of the minimum evolution (ME) method (Saitou and Imanishi 1989, Rzhetsky and Nei 1992)....
However, construction of a minimum evolution tree is time-consuming...
In the case of the NJ method, the S value is not computed for all or many topologies,... only one final tree is produced.
As mentioned above, the NJ tree is usually the same as the ME tree when the number of OTUs is small. However, if this number is large and the extent of sequence divergence is small, the topological difference between the NJ and ME trees can be substantial (Rzhetsky and Nei 1993)."
So... it's better when you have lots of taxa because it's faster and gives you one tree, but then it actually is admitted to do a poor job when there are lots of taxa.
Problems,
I a II c III g | ![]() |
Process models
Given that there are only 4 nucleotides to choose from, and assuming stochasticity, with a random assigment of nucleotides to two taxa you'd expect them to be 75% different. Here's why. The following is all possible combinations of assignment of nucleotides to two taxa.
This model is described as:
The p distance can exceed 0.75 if base substitutions are unequal; that is if the choice of nucleotides is not equal among the four. If it does, then the Jukes Cantor distance becomes undefined. That is, ln (1 - (4(0.8)/3)) = ln (- 0.07) = ? .
The Felsenstein 81 model was designed to counter this annoying problem:
where, for example, πt is the average proportional thymidine base composition
The Jukes Cantor distance can also be undefined if the number of transitions and transversions are unequal. The Kimura two-parameter distance circumvents that little annoyance:
where P = proportion of changes that are transitions and Q = proportion of changes that are transversions as determined from pairwise comparisons.
The K2P distance does not take into account different base compositions though. Other distances like the Felsenstein 84 or HKY85 distance modify the distance further to prevent it from being undefined.
Here is an example. Above the diagonal are the uncorrected p distances, below the diagonal are the K2P corrected distances. You will note that the p distance is an underestimate fo the K2P.
A B C D E Species A 0.20 0.50 0.45 0.40 Species B 0.23 0.40 0.55 0.50 Species C 0.87 0.59 0.15 0.40 Species D 0.73 1.12 0.17 0.25 Species E 0.59 0.89 0.61 0.31For a more thorough discussion of various distance methods see the Mega Manual