An estimated 15% or more of the cancer burden worldwide is attributable to known infectious agents. We screened colorectal carcinoma and matched normal tissue specimens using RNA-seq followed by host ...sequence subtraction and found marked over-representation of Fusobacterium nucleatum sequences in tumors relative to control specimens. F. nucleatum is an invasive anaerobe that has been linked previously to periodontitis and appendicitis, but not to cancer. Fusobacteria are rare constituents of the fecal microbiota, but have been cultured previously from biopsies of inflamed gut mucosa. We obtained a Fusobacterium isolate from a frozen tumor specimen; this showed highest sequence similarity to a known gut mucosa isolate and was confirmed to be invasive. We verified overabundance of Fusobacterium sequences in tumor versus matched normal control tissue by quantitative PCR analysis from a total of 99 subjects (p = 2.5 × 10(-6)), and we observed a positive association with lymph node metastasis.
Abstract
Motivation
Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due ...to a lack of linkage information over long stretches of DNA-a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics.
Results
Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.
Availability and implementation
https://github.com/bcgsc/ARCS/
Supplementary information
Supplementary data are available at Bioinformatics online.
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including ...analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.
We are in the midst of a global viral pandemic, one with no clear cure and a high mortality rate. The Human Leukocyte Antigen (HLA) gene complex plays a critical role in host immunity. Here we ...analyse data from eight patient samples from the initial COVID-19 outbreak, and report on their HLA profiles. Our analysis paradigm may help provide future insights on disease susceptibility.
We predicted HLA class I and II alleles from the transcriptome sequencing data prepared from the bronchoalveolar lavage (BAL) fluid samples of five patients and from the peripheral blood mononuclear cells (PBMC) of three patients at the early stage of the COVID-19 outbreak in Wuhan, China. We observe certain HLA allele overlap between the patients of those small cohorts, including the HLA-I A*24:02 allele predicted in four out of the five patients of the first cohort and HLA-II haplotype DPA1*02:02-DPB1*05:01 predicted in seven out of the eight patient samples analysed. Our analysis demonstrates the technical feasibility of HLA typing from BAL and PMBC metatranscriptomic samples. We stress that these are small cohorts, and although informative with respect to uncovering possible host defence genes shared between patients, additional work on larger cohorts is warranted to yield new and actionable knowledge on host susceptibility to SARS-CoV-2.
The tool used to perform the reported analysis results, HLAminer, is available from https://github.com/bcgsc/hlaminer. Predictions are available for download from https://www.bcgsc.ca/downloads/btl/SARS-CoV-2/BAL.
Supplementary data are available at Bioinformatics online.
Abstract
Background
Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly ...benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads.
Results
LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of
Caenorhabditis elegans
,
Oryza sativa
, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM.
Conclusions
Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at
https://github.com/bcgsc/longstitch
.
Antibiotic resistance is a growing global health concern prompting researchers to seek alternatives to conventional antibiotics. Antimicrobial peptides (AMPs) are attracting attention again as ...therapeutic agents with promising utility in this domain, and using in silico methods to discover novel AMPs is a strategy that is gaining interest. Such methods can sift through large volumes of candidate sequences and reduce lab screening costs.
Here we introduce AMPlify, an attentive deep learning model for AMP prediction, and demonstrate its utility in prioritizing peptide sequences derived from the Rana Lithobates catesbeiana (bullfrog) genome. We tested the bioactivity of our predicted peptides against a panel of bacterial species, including representatives from the World Health Organization's priority pathogens list. Four of our novel AMPs were active against multiple species of bacteria, including a multi-drug resistant isolate of carbapenemase-producing Escherichia coli.
We demonstrate the utility of deep learning based tools like AMPlify in our fight against antibiotic resistance. We expect such tools to play a significant role in discovering novel candidates of peptide-based alternatives to classical antibiotics.
Abstract
Motivation
In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although ...utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes.
Results
We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024.
Availability and implementation
https://github.com/bcgsc/ntedit
Supplementary information
Supplementary data are available at Bioinformatics online.
Massively parallel sequencing is a useful approach for characterizing T-cell receptor diversity. However, immune receptors are extraordinarily difficult sequencing targets because any given receptor ...variant may be present in very low abundance and may differ legitimately by only a single nucleotide. We show that the sensitivity of sequence-based repertoire profiling is limited by both sequencing depth and sequencing accuracy. At two timepoints, 1 wk apart, we isolated bulk PBMC plus naïve (CD45RA+/CD45RO-) and memory (CD45RA-/CD45RO+) T-cell subsets from a healthy donor. From T-cell receptor beta chain (TCRB) mRNA we constructed and sequenced multiple libraries to obtain a total of 1.7 billion paired sequence reads. The sequencing error rate was determined empirically and used to inform a high stringency data filtering procedure. The error filtered data yielded 1,061,522 distinct TCRB nucleotide sequences from this subject which establishes a new, directly measured, lower limit on individual T-cell repertoire size and provides a useful reference set of sequences for repertoire analysis. TCRB nucleotide sequences obtained from two additional donors were compared to those from the first donor and revealed limited sharing (up to 1.1%) of nucleotide sequences among donors, but substantially higher sharing (up to 14.2%) of inferred amino acid sequences. For each donor, shared amino acid sequences were encoded by a much larger diversity of nucleotide sequences than were unshared amino acid sequences. We also observed a highly statistically significant association between numbers of shared sequences and shared HLA class I alleles.
The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. Previously, we have identified ...prevalent HLA class I and class II alleles, including DPA1*02:02, in two small patient cohorts at the COVID-19 pandemic onset.
We have since analyzed a larger public patient cohort data (
= 126 patients) with controls, associated demographic and clinical data. By combining the predictive power of multiple
HLA predictors, we report on HLA-I and HLA-II alleles, along with their associated risk significance.
We observe HLA-II DPA1*02:02 at a higher frequency in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher's exact test FET
= 0.0174). Having this allele, however, does not appear to put this cohort's patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes, including admission to intensive care, reveal nominally significant risk associations with A*11:01 (FET
= 0.0078) and C*04:01 (FET
= 0.0087). The association with severe disease outcome is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni
= 0.0323). While prevalence of some of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with HLA-I C*04:01 tend to fare worse overall. This HLA allele may hold potential clinical value.
Recent advances in next generation sequencing have made it possible to precisely characterize all somatic coding mutations that occur during the development and progression of individual cancers. ...Here we used these approaches to sequence the genomes (>43-fold coverage) and transcriptomes of an oestrogen-receptor- -positive metastatic lobular breast cancer at depth. We found 32 somatic non-synonymous coding mutations present in the metastasis, and measured the frequency of these somatic mutations in DNA from the primary tumour of the same patient, which arose 9 years earlier. Five of the 32 mutations (in ABCB11, HAUS3, SLC24A4, SNX4 and PALB2) were prevalent in the DNA of the primary tumour removed at diagnosis 9 years earlier, six (in KIF1C, USP28, MYH8, MORC1, KIAA1468 and RNASEH2A) were present at lower frequencies (1-13%), 19 were not detected in the primary tumour, and two were undetermined. The combined analysis of genome and transcriptome data revealed two new RNA-editing events that recode the amino acid sequence of SRP9 and COG3. Taken together, our data show that single nucleotide mutational heterogeneity can be a property of low or intermediate grade primary breast cancers and that significant evolution can occur with disease progression.