Perfect phylogenies are fundamental in the study of evolutionary trees because they capture the situation when each evolutionary trait emerges only once in history; if such events are believed to be ...rare, then by Occam’s Razor such parsimonious trees are preferable as a hypothesis of evolution. A classical result states that 2-state characters permit a perfect phylogeny precisely if each subset of 2 characters permits one. More recently, it was shown that for 3-state characters the same property holds but for size-3 subsets. A long-standing open problem asked whether such a constant exists for each number of states. More precisely, it has been conjectured that for any fixed number of states r there exists a constant f(r) such that a set of r-state characters C has a perfect phylogeny if and only if every subset of at most f(r) characters has a perfect phylogeny. Informally, the conjecture states that checking fixed-size subsets of characters is enough to correctly determine whether input data permits a perfect phylogeny, irrespective of the number of characters in the input. In this article, we show that this conjecture is false. In particular, we show that for any constant t, there exists a set C of 8-state characters such that C has no perfect phylogeny, but there exists a perfect phylogeny for every subset of at most t characters. Moreover, there already exists a perfect phylogeny when ignoring just one of the characters, independent of which character you ignore. This negative result complements the two negative results (“strikes”) of Bodlaender et al. (1992, 2000). We reflect on the consequences of this third strike, pointing out that while it does close off some routes for efficient algorithm development, many others remain open. Four gamete condition; local obstructions conjecture; maximum parsimony; perfect phylogeny; phylogenetic tree.
The perfect phylogeny is a widely used model in phylogenetics, since it provides an effective representation of evolution of binary characters in several contexts, such as for example in haplotype ...inference. The model, which is conceptually the simplest among those actually used, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. Since a large number of biological phenomena cannot be modeled by the perfect phylogeny, it becomes important to find generalizations that retain the computational tractability of the original model, but are more flexible in modeling biological data when the infinite site assumption is violated, e.g. because of back mutations. In this paper, we introduce a new model—called species-driven persistent phylogeny—and we study the relations between three different formulations: perfect phylogeny, persistent phylogeny, galled trees, and species-driven persistent phylogeny. The species-driven persistent phylogeny model is intermediate between the perfect and the persistent phylogeny, since a perfect phylogeny allows no back mutations and a persistent phylogeny allows each character to back mutate only once. We describe an algorithm to compute a species-driven persistent phylogeny and we prove that every matrix admitting a galled-tree also admits a species-driven persistent phylogeny.
A perfect phylogeny is a rooted binary tree that recursively partitions sequences. The nested partitions of a perfect phylogeny provide insight into the pattern of ancestry of genetic sequence data. ...For example, sequences may cluster together in a partition indicating that they arise from a common ancestral haplotype.
We present an R package perfectphyloR to reconstruct the local perfect phylogenies underlying a sample of binary sequences. The package enables users to associate the reconstructed partitions with a user-defined partition. We describe and demonstrate the major functionality of the package.
The perfectphyloR package should be of use to researchers seeking insight into the ancestral structure of their sequence data. The reconstructed partitions have many applications, including the mapping of trait-influencing variants.
A main open question related to character-based tree reconstruction is designing generalizations of the Perfect Phylogeny approach that couple efficient algorithmic solutions to the capability of ...explaining the input binary data, by allowing back mutations of some characters. Following this goal, the Persistent Phylogeny model and the related tree reconstruction problem (the PPP problem) have been recently introduced: this model allows only one back mutation for each character. The investigation of the combinatorial properties and the complexity of the model is still open: the most important such question is whether the PPP problem is NP-hard. Here we propose a graph-based approach to the PPP problem by showing that instances can be represented by colored graphs, while the solutions are obtained by operations on such graphs. Indeed, we give a graph-based characterization of the solutions to the PPP problem by showing the relationship between certain sequences of graph operations on the instance graphs and traversals of a persistent phylogeny solving these instances. Based on this result and on some combinatorial properties of the instance graphs we are able to give a polynomial time algorithm for a restricted version of the PPP problem.
Parsimonious Clone Tree Integration in cancer Sashittal, Palash; Zaccaria, Simone; El-Kebir, Mohammed
Algorithms for molecular biology,
03/2022, Letnik:
17, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants ...(SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor's clonal composition.
To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a integration problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce PACTION (PArsimonious Clone Tree integratION), an algorithm that solves the problem using a mixed integer linear programming formulation. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our integration approach provides a higher resolution view of tumor evolution than previous studies.
PACTION is an accurate and fast method that reconstructs clonal architecture of cancer tumors by integrating SNV and CNA clones inferred using existing methods.
Abstract
Background
Cancer arises from an evolutionary process where somatic mutations give rise to clonal expansions. Reconstructing this evolutionary process is useful for treatment decision-making ...as well as understanding evolutionary patterns across patients and cancer types. In particular, classifying a tumor’s evolutionary process as either linear or branched and understanding what cancer types and which patients have each of these trajectories could provide useful insights for both clinicians and researchers. While comprehensive cancer phylogeny inference from single-cell DNA sequencing data is challenging due to limitations with current sequencing technology and the complexity of the resulting problem, current data might provide sufficient signal to accurately classify a tumor’s evolutionary history as either linear or branched.
Results
We introduce the Linear Perfect Phylogeny Flipping (LPPF) problem as a means of testing two alternative hypotheses for the pattern of evolution, which we prove to be NP-hard. We develop Phyolin, which uses constraint programming to solve the LPPF problem. Through both in silico experiments and real data application, we demonstrate the performance of our method, outperforming a competing machine learning approach.
Conclusion
Phyolin is an accurate, easy to use and fast method for classifying an evolutionary trajectory as linear or branched given a tumor’s single-cell DNA sequencing data.
Convex recoloring as an evolutionary marker Frenkel, Zeev; Kiat, Yosef; Izhaki, Ido ...
Molecular phylogenetics and evolution,
February 2017, 2017-02-00, 20170201, Letnik:
107
Journal Article
Recenzirano
Display omitted
•Adequacy measure of a taxonomic classifier to a given tree.•A statistical framework, in terms of a p-value for the score above.•Goodness of fit for supertree techniques.•Applications ...to real data.•Graphical, user-friendly, software implementation.
With the availability of enormous quantities of genetic data it has become common to construct very accurate trees describing the evolutionary history of the species under study, as well as every single gene of these species. These trees allow us to examine the evolutionary compliance of given markers (characters). A marker compliant with the history of the species investigated, has undergone mutations along the species tree branches, such that every subtree of that tree exhibits a different state. Convex recoloring (CR) uses combinatorial representation to measure the adequacy of a taxonomic classifier to a given tree. Despite its biological origins, research on CR has been almost exclusively dedicated to mathematical properties of the problem, or variants of it with little, if any, relationship to taxonomy. In this work we return to the origins of CR. We put CR in a statistical framework and introduce and learn the notion of the statistical significance of a character. We apply this measure to two data sets - Passerine birds and prokaryotes, and four examples. These examples demonstrate various applications of CR, from evolutionary relatedness, through lateral evolution, to supertree construction. The above study was done with a new software that we provide, containing algorithmic improvement with a graphical output of a (optimally) recolored tree.
Availability: A code implementing the features and a README is available at http://research.haifa.ac.il/ssagi/software/convexrecoloring.zip.
We study two problems in computational phylogenetics. The first is tree compatibility. The input is a collection P of phylogenetic trees over different partially-overlapping sets of species. The goal ...is to find a single phylogenetic tree that displays all the evolutionary relationships implied by P . The second problem is incomplete directed perfect phylogeny (IDPP). The input is a data matrix describing a collection of species by a set of characters, where some of the information is missing. The question is whether there exists a way to fill in the missing information so that the resulting matrix can be explained by a phylogenetic tree satisfying certain conditions. We explain the connection between tree compatibility and IDPP and show that a recent tree compatibility algorithm is effectively a generalization of an earlier IDPP algorithm. Both algorithms rely heavily on maintaining the connected components of a graph under a sequence of edge and vertex deletions, for which they use the dynamic connectivity data structure of Holm et al., known as HDT. We present a computational study of algorithms for tree compatibility and IDPP. We show experimentally that substituting HDT by a much simpler data structure—essentially, a single-level version of HDT—improves the performance of both of these algorithm in practice. We give partial empirical and theoretical justifications for this observation.
Hajirasouliha and Raphael (WABI 2014) proposed a model for deconvoluting mixed tumor samples measured from a collection of high-throughput sequencing reads. This is related to understanding tumor ...evolution and critical cancer mutations. In short, their formulation asks to split each row of a binary matrix so that the resulting matrix corresponds to a perfect phylogeny and has the minimum number of rows among all matrices with this property. In this paper, we disprove several claims about this problem, including an NP-hardness proof of it. However, we show that the problem is indeed NP-hard, by providing a different proof. We also prove NP-completeness of a variant of this problem proposed in the same paper. On the positive side, we propose an efficient (though not necessarily optimal) heuristic algorithm based on coloring co-comparability graphs, and a polynomial time algorithm for solving the problem optimally on matrix instances in which no column is contained in both columns of a pair of conflicting columns. Implementations of these algorithms are freely available at https://github.com/alexandrutomescu/MixedPerfectPhylogeny.