Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range ...technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
Abstract
Motivation
Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments ...of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score.
Results
Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly.
Availability and implementation
Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.
Supplementary information
Supplementary data are available at Bioinformatics online.
Orthology analysis is a fundamental tool in comparative genomics. Sophisticated methods have been developed to distinguish between orthologs and paralogs and to classify paralogs into subtypes ...depending on the duplication mechanism and timing, relative to speciation. However, no comparable framework exists for xenologs: gene pairs whose history, since their divergence, includes a horizontal transfer. Further, the diversity of gene pairs that meet this broad definition calls for classification of xenologs with similar properties into subtypes.
We present a xenolog classification that uses phylogenetic reconciliation to assign each pair of genes to a class based on the event responsible for their divergence and the historical association between genes and species. Our classes distinguish between genes related through transfer alone and genes related through duplication and transfer. Further, they separate closely-related genes in distantly-related species from distantly-related genes in closely-related species. We present formal rules that assign gene pairs to specific xenolog classes, given a reconciled gene tree with an arbitrary number of duplications and transfers. These xenology classification rules have been implemented in software and tested on a collection of ∼13 000 prokaryotic gene families. In addition, we present a case study demonstrating the connection between xenolog classification and gene function prediction.
The xenolog classification rules have been implemented in N otung 2.9, a freely available phylogenetic reconciliation software package. http://www.cs.cmu.edu/~durand/Notung . Gene trees are available at http://dx.doi.org/10.7488/ds/1503 .
durand@cmu.edu.
Supplementary data are available at Bioinformatics online.
Abstract
Summary
Bulk RNA sequencing studies have demonstrated that human leukocyte antigen (HLA) genes may be expressed in a cell type-specific and allele-specific fashion. Single-cell gene ...expression assays have the potential to further resolve these expression patterns, but currently available methods do not perform allele-specific quantification at the molecule level. Here, we present scHLAcount, a post-processing workflow for single-cell RNA-seq data that computes allele-specific molecule counts of the HLA genes based on a personalized reference constructed from the sample’s HLA genotypes.
Availability and implementation
scHLAcount is available under the MIT license at https://github.com/10XGenomics/scHLAcount.
Supplementary information
Supplementary data are available at Bioinformatics online.
Linked-read sequencing, using highly-multiplexed genome partitioning and barcoding, can span hundreds of kilobases to improve de novo assembly, haplotype phasing, and other applications. Based on our ...analysis of 14 datasets, we introduce LRSim that simulates linked-reads by emulating the library preparation and sequencing process with fine control over variants, linked-read characteristics, and the short-read profile. We conclude from the phasing and assembly of multiple datasets, recommendations on coverage, fragment length, and partitioning when sequencing genomes of different sizes and complexities. These optimizations improve results by orders of magnitude, and enable the development of novel methods. LRSim is available at https://github.com/aquaskyline/LRSIM.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Linked-read sequencing enables greatly improves haplotype assembly over standard paired-end analysis. The detection of mosaic single-nucleotide variants benefits from haplotype assembly when the ...model is informed by the mapping between constituent reads and linked reads. Samovar evaluates haplotype-discordant reads identified through linked-read sequencing, thus enabling phasing and mosaic variant detection across the entire genome. Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics. Samovar calls mosaic single-nucleotide variants (SNVs) within a single sample with accuracy comparable with what previously required trios or matched tumor/normal pairs and outperforms single-sample mosaic variant callers at minor allele frequency 5%–50% with at least 30X coverage. Samovar finds somatic variants in both tumor and normal whole-genome sequencing from 13 pediatric cancer cases that can be corroborated with high recall with whole exome sequencing. Samovar is available open-source at https://github.com/cdarby/samovar under the MIT license.
Display omitted
•Samovar uses haplotype-specific features from linked reads to call mosaic variants•Samovar quickly evaluates candidates with a random forest over 33 features•Only one sample is needed, with accuracy comparable with paired samples or trios•Samovar finds somatic variants in cancer driving genes in 13 pediatric cancer cases
Biological Sciences; Genomics; Bioinformatics
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
The simultaneous measurement of multiple modalities represents an exciting frontier for single-cell genomics and necessitates computational methods that can define cellular states based on multimodal ...data. Here, we introduce “weighted-nearest neighbor” analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of 211,000 human peripheral blood mononuclear cells (PBMCs) with panels extending to 228 antibodies to construct a multimodal reference atlas of the circulating immune system. Multimodal analysis substantially improves our ability to resolve cell states, allowing us to identify and validate previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets and to interpret immune responses to vaccination and coronavirus disease 2019 (COVID-19). Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets and to look beyond the transcriptome toward a unified and multimodal definition of cellular identity.
Display omitted
•“Weighted nearest neighbor” analysis integrates multimodal single-cell data•A multimodal reference “atlas” of the circulating human immune system•Identification and validation of novel sources of lymphoid heterogeneity•“Reference-based” mapping of query datasets onto a multimodal atlas
A framework that allows for the integration of multiple data types using single cells is applied to understand distinct immune cell states, previously unidentified immune populations, and to interpret immune responses to vaccinations.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Adjuvant proton beam therapy (PBT) is increasingly available to patients with breast cancer. It achieves better planned dose distributions than standard photon radiation therapy and therefore may ...reduce the risks. However, clinical evidence is lacking.
A systematic review of clinical outcomes from studies of adjuvant PBT for early breast cancer published in 2000 to 2022 was undertaken. Early breast cancer was defined as when all detected invasive cancer cells are in the breast or nearby lymph nodes and can be removed surgically. Adverse outcomes were summarized quantitatively, and the prevalence of the most common ones were estimated using meta-analysis.
Thirty-two studies (1452 patients) reported clinical outcomes after adjuvant PBT for early breast cancer. Median follow-up ranged from 2 to 59 months. There were no published randomized trials comparing PBT with photon radiation therapy. Scattering PBT was delivered in 7 studies (258 patients) starting 2003 to 2015 and scanning PBT in 22 studies (1041 patients) starting 2000 to 2019. Two studies (123 patients) starting 2011 used both PBT types. For 1 study (30 patients), PBT type was unspecified. Adverse events were less severe after scanning than after scattering PBT. They also varied by clinical target. For partial breast PBT, 498 adverse events were reported (8 studies, 358 patients). None were categorized as severe after scanning PBT. For whole breast or chest wall ± regional lymph nodes PBT, 1344 adverse events were reported (19 studies, 933 patients). After scanning PBT, 4% (44/1026) of events were severe. The most prevalent severe outcome after scanning PBT was dermatitis, which occurred in 5.7% (95% confidence interval, 4.2-7.6) of patients. Other severe adverse outcomes included infection, pain, and pneumonitis (each ≤1%). Of the 141 reconstruction events reported (13 studies, 459 patients), the most prevalent after scanning PBT was prosthetic implant removal (34/181, 19%).
This is a quantitative summary of all published clinical outcomes after adjuvant PBT for early breast cancer. Ongoing randomized trials will provide information on its longer-term safety compared with standard photon radiation therapy.