We revisit statistical tests for branches of evolutionary trees reconstructed upon molecular data. A new, fast, approximate likelihood-ratio test (aLRT) for branches is presented here as a ...competitive alternative to nonparametric bootstrap and Bayesian estimation of branch support. The aLRT is based on the idea of the conventional LRT, with the null hypothesis corresponding to the assumption that the inferred branch has length 0. We show that the LRT statistic is asymptotically distributed as a maximum of three random variables drawn from the distribution. The new aLRT of interior branch uses this distribution for significance testing, but the test statistic is approximated in a slightly conservative but practical way as 2(ℓ1− ℓ2), i.e., double the difference between the maximum log-likelihood values corresponding to the best tree and the second best topological arrangement around the branch of interest. Such a test is fast because the log-likelihood value ℓ2 is computed by optimizing only over the branch of interest and the four adjacent branches, whereas other parameters are fixed at their optimal values corresponding to the best ML tree. The performance of the new test was studied on simulated 4-, 12-, and 100-taxon data sets with sequences of different lengths. The aLRT is shown to be accurate, powerful, and robust to certain violations of model assumptions. The aLRT is implemented within the algorithm used by the recent fast maximum likelihood tree estimation program PHYML (Guindon and Gascuel, 2003).
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NMLJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies ...and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test aLRT and Shimodaira-Hasegawa SH-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a. Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentisi frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira-Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based methods.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NMLJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Sequence tandem repeats (TRs) are abundant in proteomes across all domains of life. For plants, little is known about their distribution or contribution to protein function. We exhaustively annotated ...TRs and studied the evolution of TR unit variations for all Ensembl plants. Using phylogenetic patterns of TR units, we detected conserved TRs with unit number and order preserved during evolution, and those TRs that have diverged via recent TR unit gains/losses. We correlated the mode of evolution of TRs to protein function. TR number was strongly correlated with proteome size, with about one‐half of all TRs recognized as common protein domains. The majority of TRs have been highly conserved over long evolutionary distances, some since the separation of red algae and green plants c. 1.6 billion yr ago. Conversely, recurrent recent TR unit mutations were rare. Our results suggest that the first TRs by far predate the first plants, and that TR appearance is an ongoing process with similar rates across the plant kingdom. Interestingly, the few detected highly mutable TRs might provide a source of variation for rapid adaptation. In particular, such TRs are enriched in leucine‐rich repeats (LRRs) commonly found in R genes, where TR unit gain/loss may facilitate resistance to emerging pathogens.
PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree ...topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696–704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira–Hasegawa–like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NMLJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated ...annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a ...series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modelled by a Markov substitution model. In contrast, the dynamics of indels are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. But the failure to model indel evolution may lead to artificially short alignments due to biased indel placement, inconsistent with phylogenetic relationship.
Recently, the classical indel model TKF91 was modified to describe indel evolution on a phylogeny via a Poisson process, termed PIP. PIP allows to compute the joint marginal probability of an MSA and a tree in linear time. We present a new dynamic programming algorithm to align two MSAs -represented by the underlying homology paths- by full maximum likelihood under PIP in polynomial time, and apply it progressively along a guide tree. We have corroborated the correctness of our method by simulation, and compared it with competitive methods on an illustrative real dataset.
Our MSA method is the first polynomial time progressive aligner with a rigorous mathematical formulation of indel evolution. The new method infers phylogenetically meaningful gap patterns alternative to the popular PRANK, while producing alignments of similar length. Moreover, the inferred gap patterns agree with what was predicted qualitatively by previous studies. The algorithm is implemented in a standalone C++ program: https://github.com/acg-team/ProPIP . Supplementary data are available at BMC Bioinformatics online.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Six extant species of non-human great apes are currently recognized: Sumatran and Bornean orangutans, eastern and western gorillas, and chimpanzees and bonobos 1. However, large gaps remain in our ...knowledge of fine-scale variation in hominoid morphology, behavior, and genetics, and aspects of great ape taxonomy remain in flux. This is particularly true for orangutans (genus: Pongo), the only Asian great apes and phylogenetically our most distant relatives among extant hominids 1. Designation of Bornean and Sumatran orangutans, P. pygmaeus (Linnaeus 1760) and P. abelii (Lesson 1827), as distinct species occurred in 2001 1, 2. Here, we show that an isolated population from Batang Toru, at the southernmost range limit of extant Sumatran orangutans south of Lake Toba, is distinct from other northern Sumatran and Bornean populations. By comparing cranio-mandibular and dental characters of an orangutan killed in a human-animal conflict to those of 33 adult male orangutans of a similar developmental stage, we found consistent differences between the Batang Toru individual and other extant Ponginae. Our analyses of 37 orangutan genomes provided a second line of evidence. Model-based approaches revealed that the deepest split in the evolutionary history of extant orangutans occurred ∼3.38 mya between the Batang Toru population and those to the north of Lake Toba, whereas both currently recognized species separated much later, about 674 kya. Our combined analyses support a new classification of orangutans into three extant species. The new species, Pongo tapanuliensis, encompasses the Batang Toru population, of which fewer than 800 individuals survive.
Display omitted
•We describe a new species of great apes, the Tapanuli orangutan Pongo tapanuliensis•Genomic analyses corroborate morphological distinctiveness of P. tapanuliensis•P. tapanuliensis comprises the oldest evolutionary lineage in the genus Pongo•With fewer than 800 individuals, P. tapanuliensis is among the most endangered great apes
Nater et al. describe a new great ape species, the Tapanuli orangutan Pongo tapanuliensis. An isolated population from Batang Toru is highly distinct from the northern Sumatran and Bornean species, based on morphological variation, corroborated by population genomic analyses. Fewer than 800 individuals of P. tapanuliensis survive in the wild.
Ralstonia solanacearum is a soil-borne beta-proteobacterium that causes bacterial wilt disease in many food crops and is a major problem for agriculture in intertropical regions. R. solanacearum is a ...heterogeneous species, both phenotypically and genetically, and is considered as a species complex. Pathogenicity of R. solanacearum relies on the Type III secretion system that injects Type III effector (T3E) proteins into plant cells. T3E collectively perturb host cell processes and modulate plant immunity to enable bacterial infection.
We provide the catalogue of T3E in the R. solanacearum species complex, as well as candidates in newly sequenced strains. 94 T3E orthologous groups were defined on phylogenetic bases and ordered using a uniform nomenclature. This curated T3E catalog is available on a public website and a bioinformatic pipeline has been designed to rapidly predict T3E genes in newly sequenced strains. Systematical analyses were performed to detect lateral T3E gene transfer events and identify T3E genes under positive selection. Our analyses also pinpoint the RipF translocon proteins as major discriminating determinants among the phylogenetic lineages.
Establishment of T3E repertoires in strains representatives of the R. solanacearum biodiversity allowed determining a set of 22 T3E present in all the strains but provided no clues on host specificity determinants. The definition of a standardized nomenclature and the optimization of predictive tools will pave the way to understanding how variation of these repertoires is correlated to the diversification of this species complex and how they contribute to the different strain pathotypes.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract
Background
Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies ...between the indel history and the phylogeny relating the input sequences.
Results
We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model—the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at
https://github.com/acg-team/ProPIP
and is distributed under the terms of the GNU GPL v3 license.
Conclusions
The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
In protein-coding genes, synonymous mutations are often thought not to affect fitness and therefore are not subject to natural selection. Yet increasingly, cases of non-neutral evolution at certain ...synonymous sites were reported over the last decade. To evaluate the extent and the nature of site-specific selection on synonymous codons, we computed the site-to-site synonymous rate variation (SRV) and identified gene properties that make SRV more likely in a large database of protein-coding gene families and protein domains. To our knowledge, this is the first study that explores the determinants and patterns of the SRV in real data. We show that the SRV is widespread in the evolution of protein-coding sequences, putting in doubt the validity of the synonymous rate as a standard neutral proxy. While protein domains rarely undergo adaptive evolution, the SRV appears to play important role in optimizing the domain function at the level of DNA. In contrast, protein families are more likely to evolve by positive selection, but are less likely to exhibit SRV. Stronger SRV was detected in genes with stronger codon bias and tRNA reusage, those coding for proteins with larger number of interactions or forming larger number of structures, located in intracellular components and those involved in typically conserved complex processes and functions. Genes with extreme SRV show higher expression levels in nearly all tissues. This indicates that codon bias in a gene, which often correlates with gene expression, may often be a site-specific phenomenon regulating the speed of translation along the sequence, consistent with the co-translational folding hypothesis. Strikingly, genes with SRV were strongly overrepresented for metabolic pathways and those associated with several genetic diseases, particularly cancers and diabetes.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK