Expression quantitative trait locus (eQTL) analysis has proven to be a powerful method to describe how variation in phenotypes may be attributed to a given genotype. While the field of bioinformatics ...and genomics has experienced exponential growth with modern technological advances, an unintended consequence arises as a lack of a gold standard for many applications and methods, which may be compounded with ever-improving computational capabilities. Researchers working on eQTL analysis have at their disposal a multitude of bioinformatics software, each with different assumptions and algorithms, which may produce confusion as to their respective applicability. In this chapter, we will introduce eQTLs, survey commonly used software to conduct a mapping study, as well as provide data correction methods to avoid the pitfalls of such analyses.
Display omitted
•Results support that Xyloplax sp. is a velatid asteroid, rather than a new class.•Asteroid morphology remained labile well after the body plan of the group was established.•The ...orders Forcipulatida, Velatida, Paxillosida, and Spinulosida were each monophyletic.•Valvatida was recovered as paraphyletic.•The earliest divergence split Velatida and Forcipulatacea from other extant lineages.•Results were robust over a wide range of data partitions and alignment parameters.
Multi-locus phylogenetic studies of echinoderms based on Sanger and RNA-seq technologies and the fossil record have provided evidence for the Asterozoa-Echinozoa hypothesis. This hypothesis posits a sister relationship between asterozoan classes (Asteroidea and Ophiuroidea) and a similar relationship between echinozoan classes (Echinoidea and Holothuroidea). Despite this consensus around Asterozoa-Echinozoa, phylogenetic relationships within the class Asteroidea (sea stars or starfish) have been controversial for over a century. Open questions include relationships within asteroids and the status of the enigmatic taxon Xyloplax. Xyloplax is thought by some to represent a newly discovered sixth class of echinoderms – and by others to be an asteroid. To address these questions, we applied a novel workflow to a large RNA-seq dataset that encompassed a broad taxonomic and genomic sample. This study included 15 species sampled from all extant orders and 13 families, plus four ophiuroid species as an outgroup. To expand the taxonomic coverage, the study also incorporated five previously published transcriptomes and one previously published expressed sequence tags (EST) dataset. We developed and applied methods that used a range of alignment parameters with increasing permissiveness in terms of gap characters present within an alignment. This procedure facilitated the selection of phylogenomic data subsets from large amounts of transcriptome data. The results included 19 nested data subsets that ranged from 37 to 4,281loci. Tree searches on all data subsets reconstructed Xyloplax as a velatid asteroid rather than a new class. This result implies that asteroid morphology remains labile well beyond the establishment of the body plan of the group. In the phylogenetic tree with the highest average asteroid nodal support several monophyletic groups were recovered. In this tree, Forcipulatida and Velatida are monophyletic and form a clade that includes Brisingida as sister to Forcipulatida. Xyloplax is consistently recovered as sister to Pteraster. Paxillosida and Spinulosida are each monophyletic, with Notomyotida as sister to the Paxillosida. Valvatida is recovered as paraphyletic. The results from other data subsets are largely consistent with these results. Our results support the hypothesis that the earliest divergence event among extant asteroids separated Velatida and Forcipulatacea from Valvatacea and Spinulosida.
An immense amount of observable diversity exists for all traits and across global populations. In the post-genomic era, equipped with efficient sequencing capabilities and better genotyping methods, ...we are now able to more fully appreciate how regulation of gene expression is consequential to one's genotypes in coding and non-coding DNA. The identification of genetic loci that contribute to quantifiable variation in genetic expression is critical in further improving our understanding of the biological regulation of complex traits. Expression quantitative traits loci (eQTLs) mapping studies have provided a powerful suite of techniques for genome wide analysis to detect these regulatory effects. However, a typical eQTL analysis relies on a large number of samples with many genetic variants to achieve robust power and significance for detection. With this in mind, eQTL analysis brings about distinct computational and statistical challenges that require advanced methodological development to overcome. In recent years, many statistical and machine learning methods for eQTL analysis have been developed with the ability to provide a more complex perspective towards the identification of relationships between genetic variation and genetic expression. In this chapter, we provide a comprehensive review of statistical and machine learning methods. We will present various machine learning methods based upon regularization terms and several other statistical analysis methods. Finally, we will discuss prior knowledge integration and hyperparameter optimization.
Expression quantitative trait locus (eQTL) analysis is a powerful method to understand the association between genetic variant and gene expression; it also has potential impact for the study of ...transcription medicine for human complex disease. In the past two decades, the researchers focus on studying the eQTL, while more and more evidence shows that the regulatory genetic variants locating noncoding region have strong effect for the gene expression. More and more researchers working on eQTL analysis realize the importance of other types of QTLs beyond eQTL. In this chapter, we will explore some QTLs beyond eQTLs that show the regulatory association with eQTLs and explain the underlying link among these types of QTLs.
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, ...short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes ...and genotype with short-read technology. We created a framework tomodel the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linkedread sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element–VNTR–Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissuespecific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.
Three-dimensional spatial organization of chromosomes is defined by highly self-interacting regions 0.1-1 Mb in size termed Topological Associating Domains (TADs). Genetic factors that explain ...dynamic variation in TAD structure are not understood. We hypothesize that common structural variation (SV) in the human population can disrupt regulatory sequences and thereby influence TAD formation. To determine the effects of SVs on 3D chromatin organization, we performed chromosome conformation capture sequencing (Hi-C) of lymphoblastoid cell lines from 19 subjects for which SVs had been previously characterized in the 1000 genomes project. We tested the effects of common deletion polymorphisms on TAD structure by linear regression analysis of nearby quantitative chromatin interactions (contacts) within 240 kb of the deletion, and we specifically tested the hypothesis that deletions at TAD boundaries (TBs) could result in large-scale alterations in chromatin conformation.
Large (> 10 kb) deletions had significant effects on long-range chromatin interactions. Deletions were associated with increased contacts that span the deleted region and this effect was driven by large deletions that were not located within a TAD boundary (nonTB). Some deletions at TBs, including a 80 kb deletion of the genes CFHR1 and CFHR3, had detectable effects on chromatin contacts. However for TB deletions overall, we did not detect a pattern of effects that was consistent in magnitude or direction. Large inversions in the population had a distinguishable signature characterized by a rearrangement of contacts that span its breakpoints.
Our study demonstrates that common SVs in the population impact long-range chromatin structure, and deletions and inversions have distinct signatures. However, the effects that we observe are subtle and variable between loci. Genome-wide analysis of chromatin conformation in large cohorts will be needed to quantify the influence of common SVs on chromatin structure.
The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown ...significance (VUS) on the enzymatic activity of the lysosomal hydrolase α‐N‐acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population‐scale analysis of disease epidemiology and rare variant association analysis.
The human genome consists of over 3 billion nucleotides that have an average distance of 3.4 Angstroms between each base, which equates to over two meters of DNA contained within the 125 μm3 volume ...diploid cell nuclei. The dense compaction of chromatin by the supercoiling of DNA forms distinct architectural modules called topologically associated domains (TADs), which keep protein-coding genes, noncoding RNAs and epigenetic regulatory elements in close nuclear space. It has recently been shown that these conserved chromatin structures may contribute to tissue-specific gene expression through the encapsulation of genes and cis-regulatory elements, and mutations that affect TADs can lead to developmental disorders and some forms of cancer. At the population-level, genomic structural variation contributes more to cumulative genetic difference than any other class of mutation, yet much remains to be studied as to how structural variation affects TADs. Here, we study the functional effects of structural variants (SVs) through the analysis of chromatin topology and gene activity for three trio families sampled from genetically diverse populations from the Human Genome Structural Variation Consortium. We then leverage clinically-relevant recurrent genomic rearrangements in acute lymphoblastic leukemia and propose a machine learning approach to identify the rare Philadelphia-like subtype based on the gene activities within lymphoblastoid chromatin domains. This analysis has found that TADs may improve our understanding of how SVs contribute to diverse gene expression patterns in health and disease.