Summary
We present an r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees.
ggtree can read more tree file formats than other softwares, including newick, ...nexus, NHX, phylip and jplace formats, and support visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages. It can also extract the tree/branch/node‐specific and other data from the analysis outputs of beast, epa, hyphy, paml, phylodog, pplacer, r8s, raxml and revbayes software, and allows using these data to annotate the tree.
The package allows colouring and annotation of a tree by numerical/categorical node attributes, manipulating a tree by rotating, collapsing and zooming out clades, highlighting user selected clades or operational taxonomic units and exploration of a large tree by zooming into a selected portion.
A two‐dimensional tree can be drawn by scaling the tree width based on an attribute of the nodes. A tree can be annotated with an associated numerical matrix (as a heat map), multiple sequence alignment, subplots or silhouette images.
The package ggtree is released under the artistic‐2.0 license. The source code and documents are freely available through bioconductor (http://www.bioconductor.org/packages/ggtree).
Abstract
This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of ...biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.
In contrast to artificial intelligence and machine learning approaches, KEGG (https://www.kegg.jp) has relied on human intelligence to develop “models” of biological systems, especially in the form ...of KEGG pathway maps that are manually created by capturing knowledge from published literature. The KEGG models can then be used in biological big data analysis, for example, for uncovering systemic functions of an organism hidden in its genome sequence through the simple procedure of KEGG mapping. Here we present an updated version of KEGG Mapper, a suite of KEGG mapping tools reported previously (Kanehisa and Sato, Protein Sci 2020; 29:28–35), together with the new versions of the KEGG pathway map viewer and the BRITE hierarchy viewer. Significant enhancements have been made for BRITE mapping, where the mapping result can be examined by manipulation of hierarchical trees, such as pruning and zooming. The tree manipulation feature has also been implemented in the taxonomy mapping tool for linking KO (KEGG Orthology) groups and modules to phenotypes.
Human genetic history in East Asia is poorly understood. To clarify population relationships, we obtained genome wide data from 26 ancient individuals from northern and southern East Asia spanning ...9,500-300 years ago. Genetic differentiation was higher in the past than the present, reflecting a major episode of admixture involving northern East Asian ancestry spreading across southern East Asia after the Neolithic, transforming the genetic ancestry of southern China. Mainland southern East Asian and Taiwan Strait island samples from the Neolithic show clear connections with modern and ancient samples with Austronesian-related ancestry, supporting a southern China origin for proto-Austronesians. Connections among Neolithic coastal groups from Siberia and Japan to Vietnam indicate that migration and gene flow played an important role in the prehistory of coastal Asia.
The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of ...transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.
High-throughput amplicon sequencing of large genomic regions remains challenging for short-read technologies. Here, we report a high-throughput amplicon sequencing approach combining unique molecular ...identifiers (UMIs) with Oxford Nanopore Technologies (ONT) or Pacific Biosciences circular consensus sequencing, yielding high-accuracy single-molecule consensus sequences of large genomic regions. We applied our approach to sequence ribosomal RNA operon amplicons (~4,500 bp) and genomic sequences (>10,000 bp) of reference microbial communities in which we observed a chimera rate <0.02%. To reach a mean UMI consensus error rate <0.01%, a UMI read coverage of 15× (ONT R10.3), 25× (ONT R9.4.1) and 3× (Pacific Biosciences circular consensus sequencing) is needed, which provides a mean error rate of 0.0042%, 0.0041% and 0.0007%, respectively.
A draft human pangenome reference Liao, Wen-Wei; Asri, Mobin; Ebler, Jana ...
Nature (London),
05/2023, Volume:
617, Issue:
7960
Journal Article
Peer reviewed
Open access
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse ...individuals
. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.