The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach ...to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions.
We tested the hypothesis that Crohn's disease (CD)-related genetic polymorphisms involved in host innate immunity are associated with shifts in human ileum-associated microbial composition in a ...cross-sectional analysis of human ileal samples. Sanger sequencing of the bacterial 16S ribosomal RNA (rRNA) gene and 454 sequencing of 16S rRNA gene hypervariable regions (V1-V3 and V3-V5), were conducted on macroscopically disease-unaffected ileal biopsies collected from 52 ileal CD, 58 ulcerative colitis and 60 control patients without inflammatory bowel diseases (IBD) undergoing initial surgical resection. These subjects also were genotyped for the three major NOD2 risk alleles (Leu1007fs, R708W, G908R) and the ATG16L1 risk allele (T300A). The samples were linked to clinical metadata, including body mass index, smoking status and Clostridia difficile infection. The sequences were classified into seven phyla/subphyla categories using the Naïve Bayesian Classifier of the Ribosome Database Project. Centered log ratio transformation of six predominant categories was included as the dependent variable in the permutation based MANCOVA for the overall composition with stepwise variable selection. Polymerase chain reaction (PCR) assays were conducted to measure the relative frequencies of the Clostridium coccoides - Eubacterium rectales group and the Faecalibacterium prausnitzii spp. Empiric logit transformations of the relative frequencies of these two microbial groups were included in permutation-based ANCOVA. Regardless of sequencing method, IBD phenotype, Clostridia difficile and NOD2 genotype were selected as associated (FDR ≤ 0.05) with shifts in overall microbial composition. IBD phenotype and NOD2 genotype were also selected as associated with shifts in the relative frequency of the C. coccoides--E. rectales group. IBD phenotype, smoking and IBD medications were selected as associated with shifts in the relative frequency of F. prausnitzii spp. These results indicate that the effects of genetic and environmental factors on IBD are mediated at least in part by the enteric microbiota.
Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic ...datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.
Researchers in evolutionary genetics recently have recognized an exciting opportunity in decomposing beneficial mutations into their proximal, mechanistic determinants. The application of methods and ...concepts from molecular biology and life history theory to studies of lytic bacteriophages (phages) has allowed them to understand how natural selection sees mutations influencing life history. This work motivated the research presented here, in which we explored whether, under consistent experimental conditions, small differences in the genome of bacteriophage φX174 could lead to altered life history phenotypes among a panel of eight genetically distinct clones. We assessed the clones' phenotypes by applying a novel statistical framework to the results of a serially sampled parallel infection assay, in which we simultaneously inoculated each of a large number of replicate host volumes with ∼1 phage particle. We sequentially plated the volumes over the course of infection and counted the plaques that formed after incubation. These counts served as a proxy for the number of phage particles in a single volume as a function of time. From repeated assays, we inferred significant, genetically determined heterogeneity in lysis time and burst size, including lysis time variance. These findings are interesting in light of the genetic and phenotypic constraints on the single-protein lysis mechanism of φX174. We speculate briefly on the mechanisms underlying our results, and we discuss the potential importance of lysis time variance in viral evolution.
Histamine has been shown to play a role in arthropod vision; it is the major neurotransmitter of arthropod photoreceptors. Histamine-gated chloride channels have been identified in insect optic ...lobes. We report the first isolation of cDNA clones encoding histamine-gated chloride channel subunits from the fruit flyDrosophila melanogaster. The encoded proteins, HisCl1 and HisCl2, share 60% amino acid identity with each other. The closest structural homologue is the human glycine α3 receptor, which shares 45 and 43% amino acid identity respectively. Northern hybridization analysis suggested that hisCl1 and hisCl2mRNAs are predominantly expressed in the insect eye. Oocytes injected with in vitro transcribed RNA, encoding either HisCl1 or HisCl2, produced substantial chloride currents in response to histamine but not in response to GABA, glycine, and glutamate. The histamine sensitivity was similar to that observed in insect laminar neurons. Histamine-activated currents were not blocked by picrotoxinin, fipronil, strychnine, or the H2 antagonist cimetidine. Co-injection of both hisCl1 and hisCl2 RNAs resulted in expression of a histamine-gated chloride channel with increased sensitivity to histamine, demonstrating coassembly of the subunits. The insecticide ivermectin reversibly activated homomeric HisCl1 channels and, more potently, HisCl1 and HisCl2 heteromeric channels.
We introduce a rapid interferer detector that uses compressed sampling (CS) with a quadrature analog-to-information converter (QAIC). By exploiting bandpass CS, a blind sub-Nyquist sampling approach, ...the QAIC offers an energy efficient and rapid interferer detection over a wide instantaneous bandwidth. The QAIC front end is implemented in 65 nm CMOS in 0.43 mm 2 and consumes 81 mW from a 1.1 V supply. It senses a frequency span of 1 GHz ranging from 2.7 to 3.7 GHz (PCAST Band) with a resolution bandwidth of 20 MHz in 4.4 μs, 50 times faster than traditional sweeping spectrum scanners. Rapid interferer detector with the bandpass QAIC is two orders of magnitude more energy efficient than traditional Nyquist-rate architectures and one order of magnitude more energy efficient than existing low-pass CS methods. Thanks to CS, the aggregate sampling rate of the QAIC interferer detector is compressed by 6.3 × compared to traditional Nyquist-rate architectures for the same instantaneous bandwidth.
The Emergency Severity Index (ESI) is the most commonly used system in over 70% of all U.S. emergency departments (ED) that uses predicted resource utilization as a means to triage 1, Mistriage, ...which includes both undertriage and overtriage has been a persistent issue, affecting 32.2% of total ED visits 2. Our goal is to develop a machine learning framework that predicts patients' resource needs, thereby improving resource allocation during triage.
This retrospective study analyzed ED visits from the Medical Information Mart for Intensive Care IV, dividing the data into training (80%) and testing (20%) cohorts. We utilized data available during triage, including patient vital signs, age, gender, mode of arrival, medication history, and chief complaint. Azure AutoML was used to create different machine learning models trained to predict the 144 target columns including laboratory panels and imaging modalities as well as medications required during patients' ED visits. The 144 models' performance was evaluated using the area under the receiver operating characteristic curve (AUROC), F1 score, accuracy, precision and recall.
A total of 391,472 ED visits were analyzed. 144 Voting ensemble models were created for each target. All frameworks achieved on average an AUC score of 0.82 and accuracy of 0.76. We gathered the feature importance for each target and observed that ‘chief complaint’, among others, had a high aggregate feature importance across different targets.
This study shows the high accuracy in predicting resource needs for patients in the ED using a machine learning model. This can greatly improve patient flow and resource allocation in already resource limited emergency departments.
Genome assembly is the problem of reconstructing genomes from DNA sequence reads. Even the best assemblies are often fragmented due to the presence of repetitive regions in the genome. Using long, ...single molecule sequencing (SMS) reads can improve the contiguity of these assemblies, but still fail to resolve long repetitive regions. Furthermore, the high error rate of SMS reads poses additional difficulties for assembly, raising the question of whether the popular de Bruijn graph (DBG) approach to genome assembly can be applied to SMS reads. First, I present ABruijn, the first genome assembler for SMS reads that follows the DBG approach. By modifying the DBG into an A-Bruijn graph, ABruijn is able to produce very polished assemblies for simple genomes such as E. coli and S. cerevisiae. However, ABruijn has some difficulties with processing very repetitive regions and very large genomes. To address ABruijn’s shortcomings, I helped to develop Flye, a DBG-based assembler for SMS reads that can be applied to large mammalian genomes such as the human genome. Flye features a much more efficient method for resolving highly repetitive regions and also generates a repeat graph, which offers a compact representation of all of the repeats in a genome. Flye further performs steps to resolve those repeats and improve the quality of the assembly, resulting in a more contiguous assembly of the human genome compared to other state-of-the-art assemblers. Finally, I present diploidFlye, a haplotype-aware extension of Flye that is able to phase the contigs for assemblies of diploid organisms. diploidFlye takes advantage of the repeat graph generated by Flye to efficiently identify heterozygous variants and generate haplocontigs (haplotype-specific contigs) from the reads. Overall, this dissertation presents several novel algorithms for improving the performance of the de novo genome assembly of long SMS reads, establishing the efficacy of the DBG approach even for error-prone SMS reads and developing a state-of-the-art assembler known as Flye with many novel features for improving the overall assembly.
Genetic studies of autism spectrum disorder (ASD) have established that de novo duplications and deletions contribute to risk. However, ascertainment of structural variants (SVs) has been restricted ...by the coarse resolution of current approaches. By applying a custom pipeline for SV discovery, genotyping, and de novo assembly to genome sequencing of 235 subjects (71 affected individuals, 26 healthy siblings, and their parents), we compiled an atlas of 29,719 SV loci (5,213/genome), comprising 11 different classes. We found a high diversity of de novo mutations, the majority of which were undetectable by previous methods. In addition, we observed complex mutation clusters where combinations of de novo SVs, nucleotide substitutions, and indels occurred as a single event. We estimate a high rate of structural mutation in humans (20%) and propose that genetic risk for ASD is attributable to an elevated frequency of gene-disrupting de novo SVs, but not an elevated rate of genome rearrangement.