NSF Proposal


The Tree of Life: Phylogeny of Spiders

Project Summary

 

            Our aim is to produce a robust phylogeny of all the deepest branches within a mega-diverse group, the spiders, by combining a massive amount of newly generated comparative genomic data with a substantial set of new and re-assessed data on morphology and behavior. Spiders are among the oldest and most diverse groups of organisms on our planet, with fossils dating back to the Devonian (c. 380 million years ago) and a current diversity of over 37,500 described species placed in 3,471 genera and 109 families.  Among the few other mega-diverse groups that comprise similarly large branches of the tree of life on Earth, spiders stand out because of their ecological importance as the dominant non-vertebrate predators in most terrestrial ecosystems.  It is probably no exaggeration to say that without spiders, human populations would be greatly affected, as insect pests would devour even more than the one-third of our crops they already destroy.  Spiders in many ways "replicate" the evolutionary experiment insects represent.

            In contrast to other non-vertebrate groups of comparable size, the cornerstones for a comprehensive phylogenetic study of spiders are at hand.  Spiders uniquely enjoy a completely up-to-date, on-line, species-level taxonomic database extending from Linnaeus to the present -- essential to taxonomic and phylogenetic research.  Deep branches of spider phylogeny have been investigated in over 50 modern, quantitative cladistic analyses that overlap to cover a surprising proportion of total spider diversity (102 of 109 families, 23% of all genera, almost 2,400 homology hypotheses), although the complete matrix jointly implied by these studies has never been assembled, much less analyzed.  These studies provide an initial hypothesis of relationships far more detailed than that available for any similarly large and important non-amniote group; probably only fishes have received comparable cladistic scrutiny.

            However, these analyses have been based almost entirely on morphological (and a small amount of behavioral) data.  The insignificant amount of genomic work to date on spiders has been uncoordinated and of little utility for broad-scale phylogenetic investigation.  The advent of high-throughput DNA sequencing, however, makes it feasible to examine substantial parts of the genome across a dense sampling of spider taxa.  We propose to sequence at least 50 "loci" (genome samples of 500-1,000 or more base pairs that can be sequenced as single pieces in both directions simultaneously) for representatives of at least 500 genera of spiders and their closest relatives (the whipscorpion orders Amblypygi, Uropygi, and Schizomida).  These genera will be carefully selected by a sampling strategy designed to maximize the resolution of deep branches within spider phylogeny, and will purposefully include all the previously most-favored study organisms of ethologists, ecologists, physiologists, and developmental and molecular biologists, thus integrating and contextualizing their research.

            Data matrices will be produced that combine the new genomic data with a new, comprehensive survey of morphological and behavioral homologies, thus offering a unique "index" to all comparative data on one large group.  The more than 20 million entries in these matrices will dwarf those of all previous studies taken together.  The computational challenges posed by such huge matrices were insoluble until recently.  New computer software, designed in large part by members of our group and using massively parallel processing to achieve supercomputing capability, makes such analyses feasible for the first time.  We will use parsimony and maximum likelihood methods of phylogenetic reconstruction to analyze our data.  We will also quantitatively assess the robustness of the results and the contribution of various data partitions to phylogenetic patterns implied by these data.  Many of the leading researchers in phylogenetic systematics are arachnologists; this proposal involves an unusually integrated, collaborative, and informed team involving 5 PIs and 10 senior researchers, postdoctoral fellows, and graduate students, working in 14 labs housed in 13 institutions and 4 countries.

            We propose to collect a huge amount of genomic information in order to test and improve the results achieved by over 50 detailed morphological cladistic analyses conducted by more than 30 investigators during the past 15 years.  For three decades, the lack of well-tested phylogenies, rather than comparative data, has been the rate-limiting step in broad-scale evolutionary research.  We propose to remove that obstacle for one large group entirely.  These data and the resulting phylogeny will have ramifications that extend far beyond systematics.  Spiders are already model organisms in behavioral (especially sexual and web-building behaviors) and ecological (foraging, predator-prey systems, integrated pest management) research.  A robust and comprehensive phylogeny for the deepest branches of this large branch of the tree of life will greatly aid expanded research in all areas of comparative biology.

 

Project Description

Results from Prior NSF Support

            G. Hormiga and J. Coddington, Monographic Research in Araneoid Spider Systematics, DEB-9712353, $415,480, 1997-2002. Three Ph.D. students  are working on the research projects funded by this grant.  Jeremy (Zujko-) Miller (working on Neotropical Erigoninae) is expected to defend his dissertation by September 02.  Ingi Agnarsson (revising Anelosimus; started Fall 98) has been advanced to candidacy and is expected to complete his dissertation by the end of 2003.  Matjaz Kuntner (revising Nephilinae; started his Ph.D. Spring 99) has completed his coursework and will take his orals in Fall 02.  We have made important progress in understanding the taxonomy and phylogenetics of our target groups of araneoid spiders.  Fieldwork carried out in Colombia (1998), Myanmar (1998), Costa Rica (1999), Guyana (1999), Chile (2000-01), South Africa (2001), Madagascar (2001) and Australia (2002); we are currently planning fieldwork in Thailand for Spring 2003.  All lab members have participated in most of these field trips.  The following products are available through our PEET project web site (www.gwu.edu/~clade/spiders/peet.htm):  Neotropical linyphiid spider taxonomic catalog; on-line catalog of the USNM spider collection; cladograms from past and upcoming papers on araneoid systematics (linked to their phylogenetic databases); on-line images of the linyphioid genera of the world (99% completed; copyright permissions pending before upload).  Publications supported by this grant include: Agnarsson (2000), Agnarsson (in press), Coddington & Colwell (2001); Griswold et al. (1999); Griswold, Long & Hormiga (1999); Herberstein et al. (2000); Hormiga (1998, 1999, 2000, in press); Hormiga & Coddington (2001); Hormiga, Scharff & Coddington (2000); Hormiga, Arnedo & Gillespie (in press); Kress et al. (1999); Kuntner & Hormiga (in press); Kuntner (in press); Kuntner & Sereg (2002); Miller (submitted); and Zujko-Miller (1999a, b).

            G. Hormiga, Scanning Electron Microscope for Systematic Biology, NSF DBI-0070362; G. Hormiga; PI & P. Herendeen, D. Lipscomb, J. Clark, D. Lieberman, Co-PIs, $118,274, 2000-2001. This grant provided funds to help establish a SEM facility at the Department of Biological Sciences (GWU). A LEO 1430VP variable SEM and accessory equipment (critical point drier, sputter coater) were purchased in 2000.  Publications resulting from the use of this equipment include: Hormiga (in press); Hormiga, Arnedo & Gillespie (in press).

            R. Gillespie and J. Coddington, Systematics of Spider Family Theridiidae, NSF DEB-9707744, 1997- 2000, $200,900. This grant provided funds to estimate the phylogeny of the spider family Theridiidae from morphological and molecular data, based on a comprehensive sample of genera. The morphological work is nearly completion and the molecular work is done, although not yet published. Five gene fragments have been sequenced and 255 morphological characters coded for 51 genera (143 terminals), and papers on molecular, morphological, and combined analysis are in preparation. The grant has contributed to the support of two post-docs and 2 graduate students. Publications supported by this grant include: Agnarsson (2000), Agnarsson (in press), Coddington & Colwell (2001), Gillespie & Oxford (1998), Griswold et al. (1999), Herberstein et al. (2000), Hormiga & Coddington (2001), Hormiga, Scharff & Coddington (2000), Hormiga, Arnedo & Gillespie (in press), Scharff & Coddington (1997), Oxford & Gillespie (1998, 2001), Sorensen et al. (2002) and Tan et al. (1999). 

            P. Sierwald, The Diplopoda: Research, Taxonomic Training and Computerization, NSF DEB 97-12438, $740,000, 1998 - 2002. . Co-PI: W. A. Shear, Hampden-Sydney College, VA.; 2 grad students, 1 post doc, 2 masters student interns, 6 undergraduate interns; FMNH millipede collection completely computerized, type collection separated. Web page: www.fmnh.org/research_collections/zoology/zoo_sites/millipeet/home.html; Publications supported by this grant include:  Bond, J.E. & P. Sierwald (In press a, b); Shelley R, P. Sierwald, S.B. Kiser & S. Golovatch (2000); Sierwald P. & S. I. Golovatch (2001); Shear, W. A. & D. A. Hubbard (1998a); Shear, W. A. & D. A. Hubbard (1998c); Shear, W. A. (1999a, 1999b, 1999c, 2000a, 2000b); Shear, W. A., M. Harvey & H. Hoch (2000); Shear, W. A. & P. Selden (2001). One on-line publication: 2001, Version 1.0, Editor: Petra Sierwald, Nomenclator Generum Diplopodorum. A complete genus listing of all genus-group names in the class Diplopoda from 1758 through 1999. Authors: Jeekel, C. A.W., R. L. Hoffman, R. M. Shelley, P. Sierwald, S. B. Kiser & S. I. Golovatch.

            W. Wheeler (with R. T. Schuh), The Evolution and Phylogeny of the True Bugs (Heteroptera), DEB 97-26587, $65,000, 1998-2001. During this three-year project, we attempted to acquire and sequence the broadest possible sample of heteropteran taxa.  Many of the specimens were obtained through fieldwork conducted by Schuh in Australia during the grant period.  Initially, new taxa were easy to acquire, and within a relatively short period we made tremendous progress toward having comparable sequences for most of the families and many of the subfamilies within the Heteroptera, as well as a dense sampling of outgroups.  The remaining 20% of the taxa were much harder to acquire.  Through continued fieldwork and contacts with colleagues, we have now sequenced 76 family-level taxa within the Heteroptera, this number being based on the revised classification of the Lygaeoidea recently published by Henry.  We have sequenced 17 outgroup families within the Hemiptera, including the Coleorrhyncha.  Sampling at the subfamily level was most dense in the Cimicomorpha and Pentatomomorpha.  The total sample includes about 445 taxa for which at least some sequence data were acquired.  The densest sampling is within the Miridae, where we have a relatively complete set of sequences for 170 taxa representing virtually all recognized suprageneric groupings. We chose to sequence the following gene regions, known to contain phylogenetic signal on the basis of prior studies: 18S rDNA (~1000 bases), CO1 ( ~1000 bases), 28s (~350 bases), 16s (~650 bases), or about 3000 bases per taxon for a total of more than 1.2 million bases.  Using these sequence data in concert with existing and newly acquired morphological data allowed testing of the following phylogenetic hypotheses: 1) suprafamilial relationships within the Hemiptera (with densest sampling for the Heteroptera); 2) family-group relationships within the Cimicomorpha; 3) family-group relationships within the Lygaeoidea; 4) family-group relationships within the Pentatomoidea; and 5) tribal-level relationships within the Miridae.  Preliminary analyses for the Cimicomorpha and Lygaeoidea indicate corroboration of the basic outlines of the hypotheses proposed by Schuh and Stys for the Cimicomorpha and by Henry for the Lygaeoidea. Publications supported by this grant: Wheeler, Whiting, Carpenter, and Wheeler (2001). 

 

Introduction

            Among the most fundamental missions of biology are a complete global inventory of the species on our planet, and a natural classification of those species on the basis of their phylogenetic relationships; the importance of both missions is well delineated in the reports and recommendations of Systematics Agenda 2000 (1994).  Phylogenetic classifications are scientific hypotheses that are crucial to all aspects of comparative biology; not only do they provide maximally efficient descriptions of the data on organismic attributes already at hand, they allow maximally effective predictions about the distributions of attributes not yet studied in detail.

            Imagine that we find a newly discovered species, and are able to identify it as a spider (for example, by discovering that it has abdominal silk glands and spinnerets, features unique to spiders).  From that information alone, we can predict, for example, that this new species will have male pedipalps that are modified for sperm transfer (another feature unique to spiders).  We can also predict that it will have the features characteristic of the larger groups to which spiders belong; as an arachnid, we can predict that the newly discovered species will have two body regions and four pairs of legs; as an arthropod, we can predict that it will have jointed appendages, etc.  Every grouping of species in a hierarchical classification enables such predictions, and the accuracy of the predictions depends on the degree to which the classification reflects the evolutionary history of the groups (i.e., the phylogenetic interrelationships of their component taxa).

            Groups of organisms are not all equally well known, of course, either in terms of inventorying all their component species, or of understanding the interrelationships among those species already described.  Estimates of species richness yet to be discovered range from about 8 million to 100 million species (Hammond, 1992), and only for the most conspicuous groups of large organisms (vertebrate animals, green plants) are we at all close to having a complete global inventory of species.  Unfortunately, vertebrate animals and green plants together represent only about 3% of the world's biota (and quite possibly the least representative 3% at that; Hammond, 1992; Platnick, 1999). This historical bias against smaller and less conspicuous organisms is also evident in the phylogenetic aspects of systematics, where it has severely hampered comparative biology.  Groups whose interrelationships are poorly understood are often actively avoided by the research community as model subjects for inquiry, leading to a vicious circle of continuing, comparative neglect.

            It is for all these reasons that the report of a recent NSF-sponsored workshop on "Assembling the Tree of Life: Research Needs in Phylogenetics and Phyloinformatics" calls for a major new initiative to resolve the basic outlines of the Tree of Life, with emphasis on the deeper branches of the tree (i.e., the oldest and most diverse groups).  We propose here to focus on spiders (Araneae), as a group that is an especially well-suited target for this initiative, by combining a massive, comparative sampling of spider genomes -- something never before undertaken, and only now achievable . with an equally thorough synthesis of the existing and new morphological and behavioral data on the same set of taxa. 

 

Why Spiders?

            Even among smaller and less conspicuous organisms, some groups have fared better than others.  Spiders are among the oldest and most diverse of such groups.  The earliest spider fossils are from 380 million year old Devonian deposits at Gilboa, New York (Shear et al., 1989), and the earliest fossils of the most closely related groups of arachnids are Devonian as well.  At present, there are over 37,500 currently valid species of spiders, grouped into 3,471 genera and 109 families.  By comparison, among the other animal groups ranked as Orders, only the five largest insect groups (Coleoptera, Lepidoptera, Hymenoptera, Diptera, and Heteroptera) and the mites (Acari) are larger.  Current estimates of the world's total spider diversity range from 76,000 (Platnick, 1999) to 170,000 (Coddington and Levi, 1991) -- in other words, somewhere between 20 and 50% of the world's total spider species have already been described and classified.  This contrasts well with other non-vertebrate taxa; the 8,000 known species of millipedes, for example, are thought to represent at most 10% of the actual total diversity, and the figure for mites would be much lower.

            Over recent decades, spider systematics has advanced dramatically, through the efforts of a relatively large number of specialists.  By way of comparison, both the Coleopterist's Society (which covers all beetles) and the American Arachnological Society (which covers all arachnids other than mites) have approximately 600 members (not all of which are systematists, of course), even though the number of beetle species in the world is an order of magnitude greater than the number of non-mite arachnids.  This disparity among research communities is also reflected in taxonomic activity; between 1978 and 1987, for example, an average of 2,300 new beetle species were described per year, whereas more than half as many (1,350) new arachnid species were described annually (Hammond, 1992), with spiders representing the lion.s share of those new descriptions.  In addition, unlike most groups of non-vertebrates, our existing knowledge of spiders is well cataloged.  The taxonomically important contents of a series of 14 large volumes of printed catalogs (Roewer, 1942-55; Bonnet, 1945-59; Brignoli, 1983; Platnick, 1989, 1993, 1998) are now available electronically in "The World Spider Catalog" (Platnick, 2002), already on-line as >13 megabytes of text (at http://research.amnh.org/entomology/spiders/catalog81-87/index.html), and on CD-ROM in database format as well, with on-line database versions to follow.  The world catalog provides fast and easy access to information on original and all subsequent descriptions, synonymies, transfers, and geographical distribution.  Mutual links are being installed between entries in The World Spider Catalog and those for spiders in GenBank (Platnick has had oversight responsibility for the systematics of the spider listings in GenBank for several years).

            Moreover, spider diversity encompasses the taxonomic levels that are most crucial to research in comparative and evolutionary biology.  In spiders, most natural history attributes (e.g., foraging styles and ecological guilds, sexual dimorphism and sex-ratio characteristics, suites of behavioral characters, and major adaptive attributes) characterize genera or at most families.  For example, all members of the family Salticidae (jumping spiders) are diurnal sight-hunters.  Larger groups, such as orders, tend not to be so coherent with regard to the biological attributes of their members (i.e., a much wider variety of foraging modes, reproductive biology, and habitats exists among the other arachnid orders).  Species, on the other hand, tend to share most such biological attributes; for example, all members of the genus Deinopis (family Deinopidae) spin identical and unique webs.  Genera and families often demarcate evolutionary novelties, e.g., shifts in foraging mode or web-construction.  Therefore, research on these "mid-level" (familial and generic) phylogenies is an absolute necessity to place most of the comparative data from other biological disciplines (especially ecological, behavioral, physiological, and developmental studies) into a predictive framework. 

 

Why Now?

            Despite the immense size of the order, spiders have benefited from a relatively long history of modern phylogenetic research (Table 1, below, at end of Project Description).  Focusing just at the generic level and above, explicit morphological matrices analyzed by quantitative techniques cover 805 of the 3,471 described genera and 102 of the 109 currently recognized families.  Ignoring overlaps in characters, these studies involved 2329 morphological characters (when overlaps are taken into account we estimate the number will reduce to perhaps 1500-- a rough indication of the number of morphological homology hypotheses to date for spiders).  In contrast, molecular data are available for fewer than 50 taxa, and with a few exceptions were gathered in order to exemplify Araneae in higher-level studies on chelicerates or arthropods, or for intrageneric studies.

            Taken together, these studies occurred over a span of 15 years and involved over 30 different investigators, methodological approaches, and systematic goals (Table 1).  Only about 200+ genera are shared between two or more matrices. Character state definitions (even of the same homology hypothesis) vary significantly among studies, depending on the taxon sample used and the goal of the study.  If combined and edited for overlaps, these matrices can be the basis for a comprehensive database of comparative morphological information on spiders. Nevertheless, key deep nodes in spider phylogeny have not been addressed by these previous studies, for example the relationship between Palpimanoidea and the remaining entelegynes.  The internal structure of Palpimanoidea and Gnaphosoidea, as well as the placements of Periegopidae, Cryptothelidae, and Zodariidae, have never been tested quantitatively.  Indeed, except for Orbiculariae, no interfamilial relationships in spiders have been tested by substantial taxon sampling of the contained genera; results to date are based purely on exemplars and very sparse taxon sampling.  Dionychan monophyly requires test.  Mygalomorph phylogeny is contentious: the classical families Dipluridae, Nemesiidae, Theraphosidae, and Cyrtaucheniidae seem to be para- or polyphyletic. The higher level phylogeny of the suborder Mygalomorphae is currently being investigated by project participants Bond and Hedin, funded by a NSF grant (see under .Plan of Work:  Ingroup.).  Within Araneomorphae, the higher-level phylogeny of Dionycha (17 families) is almost unknown (but is currently under study by project participant Ramirez), and the important tropical family Ctenidae may be polyphyletic.  Seven spider families have never been included in any cladistic quantitative study (although phylogenetic arguments exist for some): Periegopidae, Cryptothelidae, Cybaeidae, Halidae, Chummidae, Hahniidae, and Homalonychidae.  Periegopids are obviously haplogynes, the remainder entelegynes.  Cybaeids and hahniids together comprise 36 genera and show many critical character combinations that are sure to rearrange the provisional topologies suggested by the few multi-family cladistic studies published to date.  Many of the deeper nodes within Entelegynae, therefore, have only been superficially explored, and will certainly change to some extent. Many families are probably not monophyletic -- most obviously Ctenidae, but also Pisauridae, Miturgidae, Liocranidae, Corinnidae, Clubionidae, Amaurobiidae, Dictynidae, and Mysmenidae.

            This proposal seeks to produce a completely scored, internally consistent morphological and molecular matrix for at least 500 carefully chosen generic taxa that will sample all spider families; family and higher relationships will emerge as a result of detailed analysis at lower levels.  In short, morphological analyses of spider interrelationships have now advanced to the point where our current hypotheses need to be severely tested, and refined, by an entirely separate source of data.  Genomic information is the best available source of that test, and now needs to be collected on a scale comparable with that already achieved for morphology. 

 

Plan of Work:  Outgroups

            A phylogenetic study of any group must collect and analyze data on the closest relatives (outgroups) of the study group (ingroup), in order to root the resulting tree.  Two competing hypotheses on the sister group of spiders exist. One hypothesis maintains that whipspiders (order Amblypygi) constitute the sister group (Weygoldt and Paulus, 1979); the competing hypothesis maintains that Pedipalpi (Amblypygi, Uropygi and Schizomida together) is the sister group (Shultz, 1990; Wheeler and Hayashi, 1998).  If the latter hypothesis is true, including all three orders might still provide only one outgroup node. We therefore propose to obtain genomic information on representatives of all three orders as well as Palpigradi in order to assure at least two outgroup nodes in order to unambiguously polarize homology hypotheses within spiders.  Work on both the morphological and molecular characters of these outgroups will be under the primary direction of Lorenzo Prendini.

            Amblypygi.  The order Amblypygi includes ca. 141 species assigned to 19 genera and 5 families, two of which have two subfamilies each.  The Amblypygi are the best studied of the outgroup taxa; a cladogram based on morphological data is available for the families and genera, most of which are monophyletic (Weygoldt, 1996, 1999, 2000).  However, Charinus is seemingly paraphyletic and charinids are in serious need of revision (Delle Cave, 1986; Weygoldt, 2000; Harvey, in prep.).  The phrynichid subfamily Damoninae may also be paraphyletic or, if not, Trichodamon should be transferred to the Phrynichinae (P. Weygoldt, pers. comm.).  Our sampling strategy will minimally include representatives of the Damoninae and Phrynichinae (Phrynichidae), Heterophryninae and Phryninae (Phrynidae), Charinidae, and Charontidae.  Ideally, sampling would include representatives of as many of the 19 genera as possible.  The larger genera, especially Charinus, would be represented by two or more species, including (where possible) the type species.  Paracharon, from Guinea-Bissau, presently placed in a monotypic family and suborder, is considered to be the sister group of all other amblypygids (Weygoldt, 1996, 2000) and would therefore be an important (if perhaps elusive) target. Amblypygid genera and species are thinly spread across tropical and subtropical countries, often with only a single species recorded per country.  DNA samples are already in hand for eleven genera in four families and three subfamilies.  Neotropical collecting could yield four additional genera.  Phrynichosarax can be collected in India, Malaysia or Singapore, also important locales for schizomids and uropygids (see below).  The remaining genera are geographically restricted and would require collecting in Myanmar (Catageus) and South Africa (Phrynichodamon). 

            Schizomida.  The order Schizomida includes ca. 217 species assigned to 34 genera and two families (one with two subfamilies).  The higher classification of schizomids is explicitly phylogenetic, although not yet supported by quantitative analyses (Cokendolpher and Reddell, 1992).  However, the monophyly of most schizomid genera remains untested; Schizomus is particularly problematic because many older descriptions do not mention the characters now considered diagnostic (Reddell and Cokendolpher, 1995; J. C. Cokendolpher, pers. comm.).  The Schizomida are the least studied of the three outgroup orders; Harvey (unpublished) estimates that over 500 species may eventually be recognized globally.  The African and Asian schizomid faunas are the most poorly known.  Our sampling strategy will minimally include representatives of the two genera of the Protoschizomidae, the one genus of the hubbardiid subfamily Megaschizominae, and two genera of the Hubbardiinae.  Ideally, sampling would include as many of the genera as possible, with the larger genera represented by two or more species, including the type species, where necessary, but our minimal strategy addresses the most important areas of schizomid phylogeny.  Megaschizomus (from Mozambique and South Africa) is considered to be the sister group of the Hubbardiinae (Cokendolpher and Reddell 1992) and is therefore an important target for resolving hubbardiine relationships.  The endemic Mexican Protoschizomidae, which comprise the sister group of the Hubbardiidae (Cokendolpher and Reddell, 1992), are also of considerable interest because the female genitalia resemble those of diplurid spiders and charinid amblypygids (J. Cokendolpher, pers. com.). Schizomid genera and species are also tropical and subtropical, and the optimal collecting strategy thus overlaps completely that for spiders, amblypygids, and uropygids.  Collecting in Australia and Mexico could provide species from 14 genera and both families.  Exemplars from 11 additional genera could be added by collecting in Brazil, Costa Rica, Cuba, Indonesia, and Singapore.  The remaining genera are each geographically restricted and would generally not be cost-effective to secure.

            Uropygi.  The order Uropygi, generally considered the sister group of the Schizomida (Weygoldt and Paulus, 1979; Shultz, 1990; Wheeler and Hayashi, 1998) is the least speciose of the three orders, comprising 16 genera and 102 species assigned to a single family with four subfamilies.  Uropygids are poorly known and lack a phylogenetically sound classification.  Harvey.s unpublished preliminary analysis confirms the finding by Haupt and Song (1996) that the Hypoctonidae are not monophyletic.  Dunlop and Horrocks (1996) have even provided a conflicting hypothesis in which uropygid monophyly was violated by grouping the .hypoctonids. with schizomids rather than the remaining uropygids.  Several large uropygid genera (e.g., Thelyphonus) lack supporting apomorphies (Harvey, unpublished) and the status of some of the smaller genera (e.g., those erected by Speijer, 1933, 1936) is also dubious (Rowland and Cooke, 1973). Our sampling strategy will minimally include representatives of each of the four subfamilies; ideally, as many of the 16 genera as possible would be included, with problematic groups like Thelyphonus represented by two or more species, including the type species. The two major areas of uropygid endemism are in Asia and the Indo-Pacific (12 genera found from India to Fiji) and in the Americas (three genera found from the southern U.S. to Brazil).  DNA samples are already in hand for four genera from two subfamilies.  Collecting efforts in Indonesia, the Philippines, and Brazil could provide exemplars from an additional nine genera and one subfamily.  Exemplars from the remaining three genera and one subfamily could be added by collecting in India (Labochirus, Uroproctus), and West Africa (Etiennius).  Inclusion of the African and Indian species is of considerable interest from a biogeographic perspective and may be important in resolving relationships among the American and remaining Asian genera. 

 

Plan of Work:  Ingroup

            Responsibilities for the various aspects of the ingroup analyses will be divided among the investigators (see Management Plan).  Co-PIs Coddington, Hormiga, Prendini and Sierwald and senior collaborators Arnedo, Bond, Griswold, Maddison, Ramirez, Scharff and Shear will compile the morphological and behavioral parts of the matrices.  Because so much basic work on spider anatomy and behavior has already been organized cladistically, we will make every effort to include and test all of it against the genomic data.  For 500 taxa, just synthesizing and concatenating the roughly 2400 homology hypotheses to date will require roughly 800,000 novel entries, as only about 15% of the total implied matrix is currently in hand.  Four sources will augment the estimated 1,500 unique homology hypotheses in the literature.  First, many highly informative sources of morphological data have been examined in some but not all spider families; among these are spinneret spigot, setal, and tarsal organ morphology, studied through scanning electron microscopy.  Previous surveys of these and other characters will be expanded to cover the full taxon set.  Second, the taxonomic literature for many taxa suggests diagnostic characters and morphological oddities not yet assessed cladistically, e.g. onychia, details of male genitalia, male epiandrous spigots, female reproductive systems, spination patterns, cheliceral modifications, and male sperm duct trajectories.  Third, comparative biologists other than systematists have proposed many homologies over the years never assessed by rigorous systematic research (e.g. details of eye morphology, sperm ultrastructure, musculature, various gland systems, mating postures, attack behaviors, eggsac features, dragline/line climbing behavior, and especially ultrastructural features such as stridulatory structures, pore fields, hair types, and cuticular textures). Fourth, many groups at multiple hierarchical levels have never been studied phylogenetically, and are sure to yield myriad new discoveries.  We will marshal all of these morphological, behavioral, microanatomical, and ultrastructural data and unite them with newly collected molecular data to create one unified, consistent, modern encyclopedia of comparative, heritable information on spiders and their closest relatives. All collaborators will be involved in the data analysis, which will be spearheaded by Wheeler, Goloboff, and Maddison who have developed most of the software involved (see Data Analysis section, below). The choice of taxonomic exemplars is obviously crucial, but intermediate results will be required before we can identify issues such as long branches that need to be broken up by taxon addition, or important character optimizations that are made ambiguous by the omission of critical taxa.  We seek a robust phylogeny that will strongly impact comparative biology, be used widely, and be broadly applicable.   The following mix of theoretically and practically driven criteria seem important to those goals: 

 

1) The cladistically most crucial representatives of a group.s groundplan are the two most basal lineages; when the composition of this .first doublet. is suggested by an existing analysis, we will sample those lineages first.  Thus for example, within the Theridiosomatidae, we would choose Plato (Platoninae) and Epeirotypus (Epeirotypinae) over Wendilgarda and Epilineutes (Theridiosomatinae). Filistatinella among filistatids and either Filistata or Kukulcania, but not both, is another example of basal lineage selection. 

2) Except for monotypic families, at least two non congeneric species will be sampled from every family, if possible, to ensure that putative family-level synapomorphies are cladistically informative and tested against the full dataset.  For small or dubiously monophyletic families (Liocranidae, Miturgidae, Nemesiidae, etc.), the type genus will be sampled in addition to component lineages. 

3) We will use existing cladistic information to select the most basally branching genera from all significant clades.  Where no such cladograms or modern classifications exist, we will consult the most detailed classification available (Roewer's 1942-55 catalog arrangement, which included 183 subfamilies and 351 tribal groupings).  Although the family-level classification proposed by Roewer has been thoroughly refuted, many of his lower-level groupings (often taken from the previous work of Simon) may be monophyletic, and should provide a better-than-random map of the internal cladistic structure of families that have not yet been studied phylogenetically. 

4) To ensure that we have sufficiently dense sampling of those genera most crucial to establishing the deeper branches of spider phylogeny, we will bias sampling against the seven largest families (Salticidae, Linyphiidae, Araneidae, Theridiidae, Thomisidae, Lycosidae, and Gnaphosidae).  Each of these large groups is currently considered monophyletic.  We will sample them sufficiently to test current hypotheses of their internal structure (and monophyly), but five of them already have (or will have, from on-going work in our laboratories) supported phylogenies that are more detailed than the average resolution we hope to achieve for deeper clades.  For the purposes of this project, we are less interested in the details of the distal branches of the internal phylogenies of those families than in what those families have to tell us about interfamilial relationships among araneoids, lycosoids, and dionychans in general.  These seven families jointly account for almost half of the described spider genera (1,693 genera); by undersampling their terminal branches, we can achieve very dense coverage of all remaining lineages, and hence the deepest, most contentious, and most difficult questions of spider phylogeny.  We will consult widely with colleagues working on the seven large families, to achieve a choice of exemplars that will maximize synergy with their efforts and on-going studies (as, of course, we will do for the smaller families as well). 

5) To maximize the impact of our results on related fields, we will choose taxa that have been (or are likely to be) the subjects of detailed study by behaviorists, ecologists, physiologists, and other non-systematic biologists, so that their past and future results map easily to the phylogenetic and evolutionary context we will provide.  These taxa tend to be easy to find, and abundant, so choosing them is also pragmatic. 

6) As the success of high-throughput sequencing depends on the availability of high-quality DNA samples, we will attempt (through our own fieldwork and that of our colleagues) to secure fresh, adult material of all taxa if such is not already available.  Newly collected specimens, fixed in absolute ethanol, amplify much more successfully than do standard museum specimens that have been stored in 70-80% ethanol for extended periods.  Newly collected specimens also have the advantage that successful DNA amplification is usually possible using only one or two legs, so that the remaining parts of the specimen are fully useful as vouchers and for normal systematic investigation (in all cases, the genitalic structures necessary for specific-level identifications will be vouchered).  Using legs only also has the advantage of greatly reducing the possibilities of contamination by sequencing prey DNA from the digestive system of whole animals.

            The drawback, of course, is that fieldwork is required at multiple sites around the world.  Our budgeted fieldwork will "piggy-back" on existing projects wherever possible.  Charles Griswold is funded to conduct field surveys of spiders in Madagascar and China; he and his field crews will collect material in absolute alcohol for sequencing.  Bond and Hedin are currently funded to travel to South Africa, western Australia, and South America to collect mygalomorphs, and will be preserving spiders from other families for DNA and morphological work. Our continuing PEET projects allow sampling in the Neotropics, southeast Asia, and Australia.  The Smithsonian budget offers competitive funding opportunities that have supplemented substantially prior NSF-funded projects in which Coddington is a co-PI.  In other cases, it will be cheaper to provide funding to colleagues already working at target sites than to visit them ourselves, and we will aggressively pursue all such opportunities to secure needed specimens at the lowest possible cost.

            Given the two ABI 3700 sequencers, the BIOMEK sequencing robot, and a single technician line, the cost of sequencing supplies, not personnel or equipment, is the rate-limiting factor.  Were supply costs not a factor, we would aim for sequencing 100 loci for each of 1000 taxa, and we will attempt to find other sources of funding to allow additional taxa to be included.

            Bond and Hedin are currently funded by NSF to conduct work on the systematics of the Mygalomorphae. This work will combine morphological and molecular data for a comprehensive sample of mygalomorph genera from all 15 families - the target sample includes about 120 total taxa (about 110 different genera). Clearly, this effort overlaps with the proposed goals of this grant, but we see this overlap as generally synergistic in two obvious ways. First, the Bond and Hedin phylogenetic sample will be a perfect forum to explore gene utility for all spiders - mygalomorphs are an obvious clade with several well-defined subclades. Furthermore, the group includes both deep- and shallow-diverging lineages. Because Bond and Hedin will have DNA samples available for key taxa/clades before most of the TOL work begins, exploratory analyses of gene utility might best be conducted in this group. Second, the genomics results of the TOL work will feed back into the efforts of Bond and Hedin. New genes, found to be informative for the broader spider sample, might be applied to the large taxon sample of mygalomorphs. This feedback will greatly strengthen the molecular systematics component of the mygalomorph research. 

            The list of the 109 spider families showing the current number of described genera in each family, followed by the minimum number of genera we hope to sample, is presented in the Management Plan of the proposal, under .Morphological data. (see Supplementary Documentation).  

 

Sequencing Techniques

            Primer Search:  We will take three approaches to generate a set of at least 50 loci that will pcr-amplify and sequence from the spider and outgroup taxa.  These three approaches are: PCR-primer design and genomic DNA probing, EST-cDNA library generation and overlap, and a combination of the first two.   Through literature sources (e.g. Colgan et al., 1998; Damen, Weller & Tautz, 2000; Masta, 2000; Regier & Shultz, 1997; Tatarenkov et al., 1999; Wheeler, 1989; Wheeler, Cartwright & Hayashi, 1993; Whiting et al., 1997) we have designed primers that amplify and sequence 22 loci (18S-[1,800bp], 28S-[2,315bp], 16S-[550bp], CO1-[1,100bp], 12S-[300bp], H3a-[350bp], beta actin-[3,600bp], ITS1-[500bp], ITS2-[500bp], RNAHel-[500bp], Ntid-[~900bp], Amy-[500bp], Kuz-[800bp], C1-J-2309/C2-N-3389-[1,000bp-amplifies 717bp of COI and 300bp of  COII], U2-[150bp], POLII-[600bp], DDC-[700bp], cmos-[441bp], Boss-[400bp], Hb16S/HbND1-[1,569bp-amplifies 50bp of 12S, 51bp of tRNA Val, 1020bp of 16S, 53bp of tRNA Leu[CUN], and 395bp of ND1], EF-1a-[500bp, Runt gene-[400bp], Hunchback gene-[450bp]).  Six of these loci (18S rDNA, 28S rDNA, 12S mtrDNA, 16S mtrDNA, Cytochrome Oxidase I, and Histone 3a) have been examined in detail and results presented for over 100 taxa in the preliminary data section of this proposal.  Given our success with this approach, we feel confident that these methods will continue to yield loci amenable to genomic PCR amplification.   A second approach we will take is based on EST analysis and cDNA generation.  The general methodology for generating new primers is constructing cDNA libraries for a group of taxa representing the diversity of the targeted group (Carninci et al., 2000).  This targeted group may be spiders as a whole, or sub-clades may be more intensively sampled.  From these libraries we will generate EST (random sequences of the cDNA). Sequences common to multiple libraries are then used to design primers for that specific locus. The major problem facing this method is the frequency of overlap. Since sequences are generated at random, they have a very low probability of overlapping across libraries. Fortunately, there are several ways to improve overlap frequency (Piao et al., 2001; Ko, 1990) such as basing libraries on specific tissue thereby reducing the diversity of expressed genes.  Our third approach is to combine the first two by .fishing. the libraries with primers derived from whole genome computational analysis and literature-based primer design.  Tools exist within GeneBank (ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/ and http://sea-urchin.caltech.edu:8000/genome/databases/) for such procedures and should enrich our yield of homologous loci.  When candidate loci are identified, and suitable primers developed, loci will be sequenced for a small subset of taxa (ten or so) including broadly distributed representatives.  From these initial data, we will assess the level of variability (or conservation) and decisions will be made as to the suitability of continuing to collect data from that locus. Furthermore, confounding issues such as paralogy can be explored through this initial foray.  Multiple PCR bands, wildly discordant sequences, huge size variation would all lead to suspicions of homology problems with that locus.  Issues such as intron variation may be very useful information systematically, but will make sequencing efforts and primer design much more complex.  If the introns were small, and easily characterized, we would attempt to make use of this information.  If intron variation is large, however, in size of complexity, we would be unlikely to continue to invest time and energy in that system.

            Isolation of DNA:  Genomic DNA samples are obtained from fresh, frozen, or ethanol-preserved tissues in a solution of guanidinium thiocyanate homogenization buffer following a modified protocol for RNA extraction (Chirgwin et al., 1979).  Alternative automated DNA preparation is accomplished using the Qiagen Dneasy Tissue Kit: Dneasy Protocol for Animal Tissues.

            PCR amplification and Sequencing:  Our molecular lab currently uses one ABI 3700 automated sequencing machine and has added a second (NASA-funded) to accomplish its comparative sequencing projects.  A Biomek sequencing robot was recently added to the facility to automate PCR purification and sequencing procedures.  The combination of these pieces of equipment has increased our ability to sequence DNA by an order of magnitude.  The robotic sequencing machines interact directly with two Tetrad 4-head Thermocyclers. In general, amplification is carried out in a 50 ”L volume reaction, with 1.25 units of AmpliTaqź DNA Polymerase (Perkin Elmer), 200 ”M of dNTPs and 1 ”M of each primer or using Ready-To-Go PCR beads made by Amersham Pharmacia Biotech to which we add 1 l per reaction of each 10M primer, 23 l of water, and 2 l of DNA.  The PCR program consists of a initial denaturing step at 94șC for 60 seconds, 35 amplification cycles (94șC for 15 sec, 49șC for 15 sec, 72șC for 15 sec), and a final step at 72șC for 6 minutes in a GeneAmpź PCR System 9700 (Perkin Elmer) or in Tetrad 4 head Thermocyclers.  Specific conditions are optimized for taxa and primer pairs.

            PCR samples are purified with the Qiagen Qiaquick 96 PCR Purification Kit by eluting PCR product into 60 l buffer EB (on the Biomek Robot using a 96 well format).  The samples are then dried about one hour in a speed vac and resuspended in 10 ul water with the Biomek.  The isolated products are then directly sequenced using an automated ABI 3700 DNA sequencer.  Cycle-sequencing with AmpliTaqź DNA Polymerase, FS (Perkin-Elmer) using dye-labeled terminators (ABI PRISMTM BigDyeTM Terminator Cycle Sequencing Ready Reaction Kit) is performed in a GeneAmpź PCR System 9700 (Perkin Elmer) and in Tetrad 4 head Thjermocyclers.  Sequencing combines 3 l water, 2 l Big Dye, 2 l Big Dye Extender, 1 l 3.2 M primer, 2 l DNA 96 at a time using the Biomek.  The amplification program is as follows: 96șC for 15sec, 50șC for 15sec x 25, 60șC for 4 min.  Sequencing reactions are then cleaned using Isopropanol/Ethanol Precipitation:  40 l 70% isopropanol; spin 30 min at 3500rpm; flip plate upside down and spin 1 min at 500 rpm; add 40 l 70% ethanol and repeat spins; dry on bench 30 min; resuspend in 10 l formamide.  The cleaned products (in microtiter plate) are then loaded directly onto the 3700, four plates at a time. Sequences are edited and contigs assembled using "SEQUENCHER" (Gene Codes Corporation).

            The combination of the Tetrad Thermocycler, Biomek robot, and ABI 3700 make it possible for one technician to amplify or sequence several hundred reactions in a day.  The 3700.s have the capacity (using POP5 buffer) to sequence 8x96 (768) samples per day and the AMNH molecular lab has two of these machines.  The Biomek allows the complete automation of PCR purification and sequencing (on 96 or 384 well micro titer plates), thus saving the technician thousands of pipetting steps, improving accuracy and consistency (as well as state of mind), freeing the researcher to perform more intellectual tasks.  This level of automation is what makes such an ambitious sequencing project possible; our lab has the capacity to perform approximately 8000 sequencing reactions per week.

            Choice of genes.  Our explorations of genes will cover many parts of the genome, but will focus on ribosomal and nuclear protein coding genes.  It is likely that some genes that pass our initial assessment of utility will be evolving too quickly to be useful at this deep phylogenetic level.  Hence, we will first sample a relatively small number of taxa (about 50) for each of 50 genes, then run separate phylogenetic analyses for each.  Those genes whose trees show considerable concordance with those of other genes will be judged as retaining sufficient historical information to be made targets for the full 500 taxon sampling. Additional "well-behaved" genes will be sought by a similar strategy until the total number of genes with apparent deep phylogenetic signal reaches at least 50.

            Archiving of Samples:  The AMNH has established a modern frozen tissue storage facility, the Ambrose Monell Collection for Molecular and Microbial Research, intended to become a core sample resource center for comparative genomics.  The facility can store one million samples from around the world, thus representing a comprehensive range of species, both pure cultured samples of taxa under study as well as taxa that cannot currently be cultured.  These samples are housed at liquid nitrogen temperatures so that the highest quality, maximum stability conditions are maintained for biomolecules indefinitely.  Several thousand spider specimens will be added to the tissue storage facility as a result of our proposed work. Ultimately, these samples might form the kernel of an international effort to store and disseminate the genomes of all described spider genera 

 

Data Analysis

            Reconstructing the phylogenetic tree for spiders will not be an easy task analytically, both because of the depth of time spanned by the tree and the size of the data set.  Because of the time depth (deepest divergences probably extending into the Paleozoic), some branches of the tree may be long and isolated, having accumulated so many differences from other taxa that relationships are obscured.  With such noise in the data set, methods are challenged to extract the correct signal.  The size of the data set, in number of characters but particularly in number of taxa, will provide perhaps the greatest computational challenge.  For example, an analysis of 500 taxa must select, implicitly or explicitly, among the 7.8 x 101275 possible trees.  

            Our analyses will therefore provide not exact but heuristic estimates, and require exploration of varied optimality criteria and creative approaches to tree searches.  Trees will be inferred under both the parsimony (Farris, 1970, 1983; Kluge, 1984) and maximum likelihood (Cavalli-Sforza and Edwards, 1967; Felsenstein, 1973, 1979, 1981a, b, 1983; Huelsenbeck and Crandall, 1997) criteria using several programs (POY, Gladstein and Wheeler, 1997; NONA and PEE-WEE, Goloboff, 1997a,b; TNT, Goloboff, Farris, and Nixon, in prep.; PAUP*, Swofford, 2002; MrBayes, Huelsenbeck, 2000; Huelsenbeck & Ronquist, 2001).  We take this broad approach to take advantage of the diverse skills of our team, and to cross-check the quality of our heuristic estimates.   Should varied approaches give substantially similar conclusions, it will suggest that those results are robust against both violation of assumptions and differing efficacies of the alternative programs.

            Available to us are programmers experienced in phylogenetic computation, as well as excellent computational facilities.  The presence of programmers on the research team -- Wheeler, with the program POY; Goloboff, with PEE-WEE, NONA, and TNT; Maddison, with MacClade (D. Maddison & W. Maddison 2000) and Mesquite (W. Maddison & D. Maddison, 2001) -- will give us an unparalleled opportunity to refine software in the context of a massive empirical project.  In addition to state-of-the-art desktop computers in many of the participating laboratories, we will have access to a large parallel cluster. In 1999, the AMNH installed a 256-processor (500 Mhz Pentium III) cluster and in 2001 upgraded this to 560-processors (through the addition of 1 Ghz Pentium III and 1Gig RAM per node) designed especially for phylogenetic analysis of genomic data.  This parallel cluster is the fastest installed in any evolutionary biology laboratory to date.  Its size is presently being doubled and its capacity tripled, and we anticipate upgrading it again in 2003.  Parallel applications that have been developed by us include integer-intensive DNA sequence alignment and direct optimization software (written in-house); column vector-based phylogenetic algorithms, such as pNONA and a parallel version of TNT currently under development; and simulated annealing modeling of gene circuits.  The new hardware and the 2003 upgrade, along with algorithmic improvements, should keep run times for 500 taxa within 50-100 hours.  Speed is important because we intend to perform many runs to address parameter sensitivity and to explore thoroughly the analytical space implied by the data.  The total time required for these analyses may be months on this machine -- but it would be over a century on even a current, state-of-the-art single-processor PC.

            Searches for most parsimonious trees will be undertaken with the programs POY, NONA, TNT, and PAUP*, each of which has parallel versions that are operational or under development.  Members of the team will divide use of the programs according to expertise (e.g., Wheeler and Goloboff, POY, NONA and TNT; Maddison and Hedin, PAUP*) and compare results.  One fundamental difference among these programs concerns alignment of nucleotide sequences: NONA, TNT and PAUP* expect pre-aligned sequences, while POY searches simultaneously for alignment and tree.  Simultaneous alignment and tree-search is generally regarded as ideal in principle (Wheeler, 1994, 1996, 1998, 1999, 2000, 2001; Slowinski 1998; Giribet & Wheeler, 1999; Giribet et al., 2000, 2001; Wahlberg & Zimmermann, 2000), although it carries a much higher computational cost.  For analyses by NONA/TNT/PAUP*, alignments will be provided either by POY runs or by gene-by-gene application of CLUSTAL (Higgins & Sharp, 1988, 1989; Higgins et al., 1992; Thompson et al., 1994, 1997; Jeanmougin et al., 1998), using elision techniques (Wheeler et al., 1995) to choose alignment parameters (Hedin & Maddison, 2001), supplemented by manual alignment.

            Maximum likelihood analyses will be done by both PAUP* and the newly-implemented likelihood routines in POY.  Likelihood analyses are computationally more intensive than parsimony analyses of pre-aligned data, and thus we will restrict likelihood analyses to subsets of up to 150 taxa.  Subsets will be chosen in some analyses to maximize dispersion over the expected relationships, in other analyses to focus on detailed relationships within clades that appear well-corroborated otherwise.  The resulting overlapping but partial trees will be combined by supertree methods (Gordon, 1986; Baum, 1992; Ragan, 1992a, b; Bininda-Emonds & Bryant, 1998; Steel et al., 2000; Semple & Steel, 2000; Bininda-Emonds & Sanderson, 2001) and by "eye".  In addition to conventional likelihood analyses, we will use the closely related Bayesian methods (Rannala & Yang, 1996; Mau & Newton, 1997; Mao et al., 1999) as implemented in MrBayes (Huelsenbeck, 2000; Huelsenbeck & Ronquist, 2001).  In addition to allowing us to explore an alternative criterion, use of Bayesian methods will make analysis of the entire set of taxa, by a likelihood-related method, computationally feasible.

            Analyses of different data partitions (morphology, different genes) will be done separately and simultaneously. The simultaneous analysis approach, because it uses all available evidence, will be given more weight than any other single analysis in determining our primary estimate of the tree.  This analysis (and any other analysis involving morphological data) will be restricted to parsimony methods, because likelihood's usual assumption of uniform stochastic behavior is especially problematical for morphological data.  The data partitions will be analyzed separately for several reasons.  First, it will allow more refined gene-specific estimation of models for likelihood analysis.  Second, it will reveal to what extent different genetic regions (possibly evolving by different processes) offer independent corroboration for the same tree. Partitioned bremer support (Baker & DeSalle 1997, Baker et al. 1998) will be used to address the relative contributions of different genes and different morphological character systems (genitalia vs. somatic) to results of the simultaneous analysis. Third, concordance among genes will indicate which are retaining the most historical information, and thus should be the target of sequencing during denser taxon sampling. The relative degree of support for nodes in all trees obtained will be assessed with branch support indices (Bremer, 1988, 1994; Donoghue et al., 1992) and bootstrap percentages (Felsenstein, 1985; Sanderson, 1989). 

            We noted above that likelihood analyses will require decomposing the set of taxa into smaller subsets, analyzing, then grafting the resulting trees together.  We anticipate that this approach will be useful for some of the larger parsimony analyses as well.  Similarly, as some clades emerge as well-corroborated by many characters and analyses, to simplify computation we may perform subsequent analyses by constraining their monophyly, reducing the number of their sampled representatives, or analyzing the clades (with outgroups) in isolation.  Because the large number of taxa is one of the greatest burdens on the analyses, we will also use the techniques of parametric bootstrapping to study how best to increase taxon sampling.  In development in Mesquite are modules that can simulate increased taxon sampling (randomly adding taxa to a skeletal tree, perhaps to specific regions of the tree).  Characters are simulated on the augmented tree, the tree reconstructed from them, and the ability to recover the skeletal tree is compared to simulations with the skeletal tree unaugmented.  Such studies will help guide us as to how important increased taxon sampling might be, and what regions of the tree most need it.

            These diverse analyses will produce many, sometimes conflicting results. How to reconcile them?  Points of agreement will, obviously, be considered especially well supported.  However, we expect to encounter some irreconcilable differences whose resolution will await future study. 

 

Preliminary results

            The very preliminary molecular data (Fig. 1A) show a number of classically recognized groups as well as numerous problematic groupings, at least from a morphological point of view.  This result, based solely on sparse preliminary data for a very broad range of taxa, is expected because to date we still lack about 30% of the sequences for the taxa in Fig. 1A, and because the range of time represented in Fig. 1A clearly exceeds the ability of only six genes (of which three are mitochondrial) to provide robust phylogenetic signal on deep nodes. Obviously important genes for deep nodes, such as EF1a and Pol II, are missing. In the few months available before the TOL deadline, we produced an impressive amount of preliminary molecular data: 617 sequenced fragments from six genes of 98 taxa. The choice of taxa was largely drawn from the fresh material on hand during a north temperate winter at one museum: lycosoids, symphytognathoids, and cribellate groups were undersampled, and dionychans and trochanteriids oversampled. We present results here not to argue that the choice of genes or taxa in this preliminary data set was ideal, much less that our results are definitive in any way, but simply to show that the ambitious DNA sequencing schedule is feasible. Given the full range of genes envisaged above, we anticipate that the concordance between molecules and morphology will dramatically improve. Fig. 1A recovers many accepted shallow nodes in spider phylogeny, and recovers many of the deeper nodes approximately.  For example, the orders Uropygi and Amblypygi are monophyletic, as are the families Austrochilidae, Archaeidae, Theridiidae, Synotaxidae, Agelenidae, Desidae, Deinopidae, Zodariidae, Eresidae, Liocranidae, Clubionidae, Anyphaenidae, Salticidae, and Oxyopidae.  The doublets Scytodidae-Sicariidae, Oonopidae-Orsolobidae, Diguetidae-Plectreuridae, Amphinectidae-Desidae, and Oecobiidae-Hersiliidae are recovered. Malkarids group with pararchaeids, which is not unreasonable. Corinnids may be polyphyletic. Thomisid, filistatid, and trochanteriid genera largely group together, suggesting perhaps problems with the individual sequences or taxa. Of course, the results are also very noisy. To name a few, liphistiids are definitely spiders, archaeids are not sister to austrochilids, linyphioids are not monophyletic, deinopids and Tengella probably do not belong in dionychans. Both Palpimanoidea and Deinopoidea are scattered throughout the tree. Deeper nodes are less convincing and the large clades often include taxa that clearly do not belong, but the molecular data do find some previously hypothesized higher groups of spiders. Araneomorphae, Haplogynae, Entelegynae, Divided Cribellum Clade, the RTA Clade, Fused Paracribellar Clade, Araneoidea, and Gnaphosoidea are substantially intact. The rooting of Araneoidea within Entelegynae, however, is upside down.

            The combined analysis (Fig. 1B) corrects some of the problems with deeper nodes (Araneae, Opisthothelae), but in fact the morphological data for this particular set of taxa are quite preliminary as well, especially as regards dionychans. Although being actively studied, the morphological evidence bearing on dionychan relationships has never been assessed, and no one has attempted to concatenate morphological homology hypotheses at this sampling density across all spiders. The morphology dataset used here by itself yields 1664 most parsimonious trees and contains large polytomies. Given more time to .stitch. together morphological knowledge, we are confident that comparative morphology will supply a very strong phylogenetic signal for spiders.

            POY (vers. 3.0) now supports maximum likelihood analysis with simultaneous dynamic sequence alignment on combined data of any sort (morphological, fossil, molecular, etc.). We treated the molecular data as nine complex characters (six molecules, plus three distinct 18S regions) under a seven parameter substitution model (GTR+indels) with adjustments for invariant sites and 4 class gamma rate distribution, empirically estimating five "base" frequencies (A, C, G,T, and "gap").  Each parameter was independently estimated for each of the nine molecular "characters."  Morphological data were analyzed using the model of Tuffley & Steel (1997).  The results are roughly comparable to the parsimony results, recovering, for example, Amblypygi, Pedipalpi, Austrochilidae, Deinopidae, Archaeidae, Theridiidae, Synotaxidae, Agelenidae, Desidae, Eresidae, Liocranidae, Clubionidae, Anyphaenidae, Salticidae, Oxyopidae, Scytodidae-Sicariidae, Oonopidae-Orsolobidae, and Amphinectidae-Desidae, but losing, for example, zodariid monophyly. Problems are still evident at deeper nodes. We believe these problems are data-dependent and will be evanescent, but wish to emphasize that our computational capabilities now include various implementations of maximum likelihood and parsimony, whether of partitioned or simultaneously analyzed data, and constitute uniquely powerful tools for exploratory data analysis. 

 

Broad Impacts

            Mega-diverse groups like spiders are a major element of our planet.s biocomplexity, performing crucial roles in the ecological processes that support human life.  Because of the historical emphasis on larger, more conspicuous organisms, groups like spiders have been comparatively neglected, and a full appreciation of their role in the evolutionary history of life on earth has been impossible to achieve.  By combining a massive comparative genomic survey of spiders with an equally thorough survey of new and existing morphological and behavioral data, we will be able to elucidate the history of a major chunk of the tree of life, on a global scale. 

            Through this project we will help train at least three postdoctoral fellows and at least three graduate students, in all aspects of this interdisciplinary effort, from morphological and molecular data collection to the details of modern computational techniques for phylogenetic analysis.  New and encyclopedic archives of all the available comparative data on a wide range of taxa, and the results of our analyses of those data, will be made available electronically to colleagues everywhere through the www.  

            Training and Education.  This research proposal brings in a strong training and educational commitment covering a wide range of educational levels, from high school students to postdoctoral trainees.  Four postdoctoral trainees will be trained during the duration of the grant.  We favor postdoctoral tenures of two years.  These trainees will have their .home base. at the AMNH, CAS, GWU and the Smithsonian.  If this project were to be funded, CAS would match to hire a postdoc for two years to work on this research (see letter from Griswold in Supplementary Documentation).  We will implement a system of laboratory rotation that will allow postdocs to work on project areas complementary to their primary research responsibilities.  For example, the postdoc at GWU will be responsible primarily for collecting the morphological data of a particular clade (or group of clades).  During his/her two year tenure at GWU this trainee will carry out a smaller project at a molecular lab (e.g., the AMNH.s) so he/she can complement their expertise by becoming familiar with the collection and analysis of sequence data.  This hands-on approach will help trainees to become familiar with the diversity of data and analytical approaches involved in the project as a whole.  

            Several graduate students will be actively involved in this research project.  The following project participants are based at academic institutions that offer graduate programs in systematics (e.g., www.gwu.edu/~clade) and can act as primary graduate advisors: Arnedo (U. Barcelona), Bond (ECU), Hedin (SDSU), Hormiga (GWU), Maddison (U. Arizona), and Scharff (U. Copenhagen).  All the remaining PIs and collaborators have formal ties with academic institutions, such as adjunct professorships at local universities, and can co-direct theses and dissertations (see Biographical Sketches). The diversity of participating institutions (universities, museums, research institutes, etc.) and approaches (morphologists, molecular systematists, theoreticians, programmers, etc.) will provide a fertile milieu for research exchanges that will be particularly beneficial for students.  In total our group has more than 106 years of experience as biology educators and have been involved, in cumulative terms, in the training of more than 56 graduate students and 19 postdoctoral associates.  Given the magnitude and scope of the TOL project we favor training doctoral students over M.S., but are open to latter.  As stated for the postdoctoral trainees, we will implement a system of lab rotation for graduate students to ensure training in all aspects of systematics.  In addition, we will try to integrate existing (or future) graduate students in our labs within the TOL project, by complementing the scope of their doctoral projects.  For example, a graduate student working on species-level systematics for his/her dissertation could contribute to the TOL project by placing the genus in the higher-level cladistic context by using and contributing to TOL data.  Such an approach would be mutual