Display omitted
•Up to 86% of internal gene-tree branches are dubiously or arbitrarily resolved.•Collapsing branches increased species-tree coalescent branch lengths by up to 455%.•Severe and clearly ...justified collapsing method for likelihood: 0% SH-like aLRT.•Severe and clearly justified collapsing method for parsimony: strict consensus.•Collapsing can improve congruence between coalescent and concatenation analyses.
In two-step coalescent analyses of phylogenomic data, gene-tree topologies are treated as fixed prior to species-tree inference. Although all gene-tree conflict is assumed to be caused by lineage sorting when applying these methods, in empirical datasets much of the conflict can be caused by estimation error. Weakly supported and even arbitrarily resolved clades are important sources of this estimation error for gene trees inferred from few informative characters relative to the number of sampled terminals, and the resulting extraneous conflict among gene trees can negatively impact species-tree inference. In this study, we quantified the relative severity of alternative methods for collapsing gene-tree branches for seven empirical datasets and quantified their effects on species-tree inference. The branch-collapsing methods that we employed were based on the strict consensus of optimal topologies, various bootstrap thresholds, and 0% approximate likelihood ratio test (SH-like aLRT) support. Up to 86% of internal gene-tree branches are dubiously or arbitrarily resolved in reanalyses of these published phylogenomic datasets, and collapsing these branches increased inferred species-tree coalescent branch lengths by up to 455%. For two datasets, the longer inferred branch lengths sometimes impacted inference of anomaly-zone conditions. Although branch-collapsing methods did not consistently affect the species-tree topology, they often increased branch support. The more severe and clearly justified gene-tree branch-collapsing methods, which we recommend be broadly applied for two-step coalescent analyses, are use of the strict consensus in parsimony analyses and the collapse clades with 0% SH-like aLRT support in likelihood analyses. Collapsing dubiously or arbitrarily resolved branches in gene trees sometimes improved congruence between coalescent-based results and concatenation trees. In such cases, we contend that the resolution provided by concatenation should be preferred and that incomplete lineage sorting is a poor explanation for the initial conflict between phylogenetic approaches.
Display omitted
•(Amborella, Nuphar) resolution by coalescence methods is an artifact of mis-rooting.•Amborella alone is supported as sister to the remaining extant angiosperms.•ASTRAL is more robust ...to incorrectly rooted gene trees than MP-EST or STAR.•OV and TIGER biased in favor of characters with asymmetrical state distributions.•Novel methods may be novel sources of systematic errors.
It has recently been concluded that phylogenomic data from 310 nuclear genes support the clade of (Amborellales, Nymphaeales) as sister to the remaining angiosperms and that shortcut coalescent phylogenetic methods outperformed concatenation for these data. We falsify both of those conclusions here by demonstrating that discrepant results between the coalescent and concatenation analyses are primarily caused by the coalescent methods applied (MP-EST and STAR) not being robust to the highly divergent and often mis-rooted gene trees that were used. This result reinforces the expectation that low amounts of phylogenetic signal and methodological artifacts in gene-tree reconstruction can be more problematic for shortcut coalescent methods than is the assumption of a single hierarchy for all genes by concatenation methods when these approaches are applied to ancient divergences in empirical studies. We also demonstrate that a third coalescent method, ASTRAL, is more robust to mis-rooted gene trees than MP-EST or STAR, and that both Observed Variability (OV) and Tree Independent Generation of Evolutionary Rates (TIGER), which are two character subsampling procedures, are biased in favor of characters with highly asymmetrical distributions of character states when applied to this dataset. We conclude that enthusiastic application of novel tools is not a substitute for rigorous application of first principles, and that trending methods (e.g., shortcut coalescent methods applied to ancient divergences, tree-independent character subsampling), may be novel sources of previously under-appreciated, systematic errors.
Display omitted
•Bias by likelihood and Bayesian methods in favor of characters without missing data.•Bias applies to optimal tree, bootstrap, SH-like aLRT, and posterior probabilities.•Bias can ...occur when missing or inapplicable data are in a single terminal.•Bias persists despite sampling numerous characters.•Parsimony is robust to the bias.
Contrived and simulated examples were used to quantify the range of conditions in which maximum likelihood and Bayesian MCMC methods are biased in favor of phylogenetic signal present in globally sampled characters over that present in conflicting locally sampled characters (those with missing data). The bias occurs in both the optimal tree identified as well as branch supports even when there are more locally sampled characters supporting the conflicting topology. The bias can lead to high bootstrap, SH-like aLRT support (up to 100%), and posterior probabilities for the conflicting clades. The bias can occur even when only a single terminal has missing data. The bias is not limited to likelihood methods that only ever present a single optimal tree that is fully resolved (as in PhyML and RAxML)—it can also occur in branch-and-bound PAUP∗ searches. The bias persists despite sampling numerous characters, and the bias is consistently unidirectional. The bias may occur in the context of incongruence between gene trees as well as within a single gene wherein terminals have different sequence lengths caused by DNA-amplification differences or gaps caused by indels. This bias is another example wherein commonly implemented parametric phylogenetic methods interpret ambiguity as support. In contrast, parsimony is robust to the bias.
• Photoautotrophic growth in nature requires the accumulation of energy-containing molecules via photosynthesis during daylight to fuel nighttime catabolism. Many diatoms store photosynthate as the ...neutral lipid triacylglycerol (TAG). While the pathways of diatom fatty acid and TAG synthesis appear to be well conserved with plants, the pathways of TAG catabolism and downstream fatty acid β-oxidation have not been characterised in diatoms.
• We identified a putative mitochondria-targeted, bacterial-type acyl-CoA dehydrogenase (PtMACAD1) that is present in Stramenopile and Hacrobian eukaryotes, but not found in plants, animals or fungi. Gene knockout, protein-YFP tags and physiological assays were used to determine PtMACAD1’s role in the diatom Phaeodactylum tricornutum.
• PtMACAD1 is located in the mitochondria. Absence of PtMACAD1 led to no consumption of TAG at night and slower growth in light : dark cycles compared with wild-type. Accumulation of transcripts encoding peroxisomal-based β-oxidation did not change in response to day : night cycles or to PtMACAD1 knockout. Mutants also hyperaccumulated TAG after the amelioration of N limitation.
• We conclude that diatoms utilise mitochondrial β-oxidation; this is in stark contrast to the peroxisomal-based pathways observed in plants and green algae. We infer that this pattern is caused by retention of catabolic pathways from the host during plastid secondary endosymbiosis.
Display omitted
► Non-random missing data, without rate heterogeneity, can cause misleading results. ► Non-random missing data, w/o informative characters, can cause misleading results. ► Artifacts ...were found to occur frequently using 22 empirical examples. ► Partitioning based on missing data helps, but does not eliminate, artifacts. ► Artifacts are exacerbated by low quality tree searches.
Non-random distributions of missing data are a general problem for likelihood-based statistical analyses, including those in a phylogenetic context. Extensive non-randomly distributed missing data are particularly problematic in supermatrix analyses that include many terminals and/or loci. It has been widely reported that missing data can lead to loss of resolution, but only very rarely create misleading or otherwise unsupported results in a parsimony context. Yet this does not hold for all parametric-based analyses because of their assumption of homogeneity across characters and lineages, which can lead to both long-branch attraction and long-branch repulsion. Contrived examples were used to demonstrate that non-random distributions of missing data, even without rate heterogeneity among characters and a well fitting model, can provide misleading likelihood-based topologies and branch-support values that are radically unstable based on slight modifications to character sampling. The same can occur despite complete absence of parsimony-informative characters. Otherwise unsupported resolution and high branch support for these clades were found to occur frequently in 22 empirical examples derived from a published supermatrix. Partitioning characters based on the distribution of missing data helped to decrease, but did not eliminate, these artifacts. These artifacts were exacerbated by low quality tree searches, particularly when holding only a single optimal tree that must be fully resolved.
Species richness is greatest in the tropics, and much of this diversity is concentrated in mountains. Janzen proposed that reduced seasonal temperature variation selects for narrower thermal ...tolerances and limited dispersal along tropical elevation gradients Janzen DH (1967) Am Nat 101:233–249. These locally adapted traits should, in turn, promote reproductive isolation and higher speciation rates in tropical mountains compared with temperate ones. Here, we show that tropical and temperate montane stream insects have diverged in thermal tolerance and dispersal capacity, two key traits that are drivers of isolation in montane populations. Tropical species in each of three insect clades have markedly narrower thermal tolerances and lower dispersal than temperate species, resulting in significantly greater population divergence, higher cryptic diversity, higher tropical speciation rates, and greater accumulation of species over time. Our study also indicates that tropical montane species, with narrower thermal tolerance and reduced dispersal ability, will be especially vulnerable to rapid climate change.
Contemporary phylogenomic studies frequently incorporate two‐step coalescent analyses wherein the first step is to infer individual‐gene trees, generally using maximum‐likelihood implemented in the ...popular programs PhyML or RAxML. Four concerns with this approach are that these programs only present a single fully resolved gene tree to the user despite potential for ambiguous support, insufficient phylogenetic signal to fully resolve each gene tree, inexact computer arithmetic affecting the reported likelihood of gene trees, and an exclusive focus on the most likely tree while ignoring trees that are only slightly suboptimal or within the error tolerance. Taken together, these four concerns are sufficient for RAxML and PhyML users to be suspicious of the resulting (perhaps over‐resolved) gene‐tree topologies and (perhaps unjustifiably high) bootstrap support for individual clades. In this study, we sought to determine how frequently these concerns apply in practice to contemporary phylogenomic studies that use RAxML for gene‐tree inference. We did so by re‐analyzing 100 genes from each of ten studies that, taken together, are representative of many empirical phylogenomic studies. Our seven findings are as follows. First, the few search replicates that are frequently applied in phylogenomic studies are generally insufficient to find the optimal gene‐tree topology. Second, there is often more topological variation among slightly suboptimal gene trees relative to the best‐reported tree than can be safely ignored. Third, the Shimodaira–Hasegawa‐like approximate likelihood ratio test is highly effective at identifying dubiously supported clades and outperforms the alternative approaches of relying on bootstrap support or collapsing minimum‐length branches. Fourth, the bootstrap can, but rarely does, indicate high support for clades that are not supported amongst slightly suboptimal trees. Fifth, increasing the accuracy by which RAxML optimizes model‐parameter values generally has a nominal effect on selection of optimal trees. Sixth, tree searches using the GTRCAT model were generally less effective at finding optimal known trees than those using the GTRGAMMA model. Seventh, choice of gene‐tree sampling strategy can affect inferred coalescent branch lengths, species‐tree topology and branch support.
Display omitted
•Summary coalescent methods are not robust to gene-tree misrooting errors.•Summary coalescent methods are not robust to homology errors.•Summary coalescent methods are not robust to ...differential sampling of taxa.•Additional conflicting retroelement insertions are revealed.•20,850 loci and 4345 retroelements do not robustly resolve palaeognath phylogeny.
Phylogenomic analyses of ancient rapid radiations can produce conflicting results that are driven by differential sampling of taxa and characters as well as the limitations of alternative analytical methods. We re-examine basal relationships of palaeognath birds (ratites and tinamous) using recently published datasets of nucleotide characters from 20,850 loci as well as 4301 retroelement insertions. The original studies attributed conflicting resolutions of rheas in their inferred coalescent and concatenation trees to concatenation failing in the anomaly zone. By contrast, we find that the coalescent-based resolution of rheas is premised upon extensive gene-tree estimation errors. Furthermore, retroelement insertions contain much more conflict than originally reported and multiple insertion loci support the basal position of rheas found in concatenation trees, while none were reported in the original publication. We demonstrate how even remarkable congruence in phylogenomic studies may be driven by long-branch misplacement of a divergent outgroup, highly incongruent gene trees, differential taxon sampling that can result in gene-tree misrooting errors that bias species-tree inference, and gross homology errors. What was previously interpreted as broad, robustly supported corroboration for a single resolution in coalescent analyses may instead indicate a common bias that taints phylogenomic results across multiple genome-scale datasets. The updated retroelement dataset now supports a species tree with branch lengths that suggest an ancient anomaly zone, and both concatenation and coalescent analyses of the huge nucleotide datasets fail to yield coherent, reliable results in this challenging phylogenetic context.
I assert that similarity is the appropriate homology criterion for sequence alignment, as it is with morphology. Methods that select among alignments using parsimony-based tree lengths, as ...implemented in MALIGN and POY, arrange the data such that they are consistent with a minimum-evolution model. When combining data sets in phylogenetic analyses, we are not trying to reinforce our earlier hypotheses about relationships, but rather to test them. The severity of this test is compromised when congruence with other characters is favored when selecting among alignment parameters.