Cladistics - weighting


Various arguments have been offered regarding the notion that not all characters should be considered of equal value in a phylogenetic analysis. It is not the intention of this particular discussion to consider the merits of the arguments in any great detail but rather to show how this is implemented and their effects.

To show very simply how weighting can affect an outcome, consider the data set

one   aaaa
two   acac
three cccc
four  caca
which defines two equally parsimonious trees (homoplasy in red):
 _____ one   aaaa
|_____ two   acac
|   __ three cccc
`--|__ four  caca

and
 _____ one   aaaa
|   __ two   acac
|--|__ three cccc
|_____ four  caca

If we now weight character #2 by 2 (a formal statement that it takes 2 changes by any other character or combination of other characters to refute the information about relationships offered by a change in character 2) we get only one most parsimonious tree:
 _____ one   aaaa
|   __ two   acac
|--|__ three cccc
|_____ four  caca
Weighting methods can be roughly divided into two classes
  1. a priori and
  2. a posteriori

A Priori Weighting

A posteriori Weighting

Usually the arguments in favor of this kind of method of weighting are based on frequencies of kind of change.
Also it is predicated on the notion that homoplasy is a "bad" thing and that homoplasy varies inversely with "reliability".

For example,

With these forms of weighting, sometimes it is argued that one should simply eliminate a class of changes entirely (e.g., Knight and Mindell, 1993; Mindell et al., 1995; Mindell et al., 1996; Mindell and Thacker, 1996; but see Cummings et al., 1995).

Knight and Mindell (1993) stated that "a level of TIs near 50% indicates that this class of change is saturated with multiple changes and is therefore not an indicator of phylogeny but is largely 'noise'" (see also Mindell and Honeycutt, 1990). In some cases (e.g., Mindell et al., 1995) all gapped regions in alignments as well as third position transitions are eliminated. And this can be determined from saturation plots (below) in which the number of transitions and transversions are plotted against total substitutions for all pairwise comparisons of taxa.

More often, though, it is argued that one should simply weight according to the inverse of the frequency.

If so, the frequency has to be determined across some tree. This can be a preliminary tree based on unweighted data (which leaves you open to the criticism of cricularly biasing your results towards that unweigted tree because ofweighting in light of the unweigted tree), or across a random set of trees.

Given the tree here for primate mtDNA (left), one can get a preliminary tree (by a search on the unweighted data) and then calulate the amount of change in 1st, 2nd and 3rd positions, which is 260:146:594 and would imply a weighting scheme for positions of about 2:4:1.

With the "trace changes and stasis" option in MacClade's Chart menu, determine the relative frequency of the various chages (right). On the whole, then, the number of Ti's is 470, and Tv's is 214, which would suggest a weighting of Tv's:Ti's of about 2.

Random trees for the same data suggest a number of Ti's between 576 and 715 and a number of Tv's between 938 and 1113. So, weighting transversion somewhere between 1.5 and 2 appears to be indicated.

How is this implemnented? In PAUP, one can set up a Sankoff matrix to define these weights as follows:

BEGIN ASSUMPTIONS;
USERTYPE tv = 4 a c g t
- 2 1 2
2 - 2 1
1 2 - 2
2 1 2 -; END;
That gets placed at the end of the file and after executingthe file, you need to type in the command line:
ctype tv: all; to apply it to all characters, or
ctype tv: 3-./3; for example to apply it to all 3rd positions only.

Williams and Fitch (and Fitch and Ye) took this one step further by suggesting dynamic weighting. In this, each class of trnasformation in each direction is weighted inversely to it frequency. For the primate data, the frequencies of each transformation is:
To
From
ACGT
A-0.1010.1800.060
C0.080-0.0150.295
G0.0290.004-0.001
T0.0440.1830.007-

Because transformations are weighted inversely to their frequency, this would suggest the following Sankoff matrix:

BEGIN ASSUMPTIONS;
USERTYPE tv = 4 a c g t
- 900 820 940
920 - 985 705
970 996 - 999
956 817 993 - ; END;
How does this affect the primate data set?
It doesn't. Weighting transversions heavier by 2, or infinitely, or dynamic weighting all return the same tree as the unrooted tree, notwithstanding the saturation plot for these data above.

Cautionary note

Frequencies are inherently rate-based arguments and thus if one is going to justify weighting according to this principle, one should weight not according to the inverse of the frequency but accoring to the natural logarithm of the inverse of the frequency.
Thus, a Sankoff matrix that looks like this:
BEGIN ASSUMPTIONS;
USERTYPE tv = 4 a c g t
- 2 1 2
2 - 2 1
1 2 - 2
2 1 2 -; END;
Really applies to a situation where transitions are more than 7 times as frequent as transversions.

Successive Approximations

Farris suggested that each character could be considered independently with respect to a weight implied by frequency of change. The character consistency index (minsteps/observedsteps) varies inversely with frequency of change on a tree, as does the character retention index ((maxsteps-observedsteps)/(maxsteps-minsteps)).
Farris sugegsted that the rescaled character consistency index ci*ri could be used as a weighting function.

The procedure involves

  1. finding a preliminary tree based on unweighted data
  2. reweighting by the rescaled consistency index
  3. searching again
  4. repeat steps 2 and 3 until the weights don't don't change
In Hennig86 this is accomplished with successive bouts of
mh;bb;xs w;
which can be done by typing that in once, and then hitting the F3 key. In PAUP, the equivalent is
hsearch/swap = tbr; reweight /fit = rc;
which would have to be repeated as well.

Goloboff's Implied Weights

Suppose you have two equally parsimonious trees of length 154.
Now suppose that the only difference between those trees is disagreement between character #234 and character #1238.
                  234  1238
Steps on tree 1 :  2    14
Steps on tree 2 :  1    15

The proportional difference in number of steps for character 234 is much larger than for character 1238.

That is, for character 234 the proportional disagreement is:

(2 - 1)/(2*1) = 0.500
Whereas for character 1238 is:
(15-14)/(15*14) = 0.005
Or, character 234 is very different on the two trees (strongly supports tree #2) while character 1238 is not very different on the two trees (weakly supports tree #1).

The argument is that one should prefer tree #2.

Goloboff developed a way to weight characters without requiring a priori reference to a tree in which the implied weight for a character is:

W = K/(K+ (maxsteps - observed steps))
K is an arbitrary constant and there is no a priori way to choose a value for K so it is reccoemended that one try 1, 2, 3 etc and see what happens.
The opimality criterion, then is to choose the "heaviest" tree (the tree that maximizes the sum of all W's across all characters).