The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS ...assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.
Facial morphology-a conspicuous feature of human appearance-is highly heritable. Previous studies on the genetic basis of facial morphology were performed mainly in European-ancestry cohorts (EUR). ...Applying a data-driven phenotyping and multivariate genome-wide scanning protocol to a large collection of three-dimensional facial images of individuals with East Asian ancestry (EAS), we identified 244 variants in 166 loci (62 new) associated with typical-range facial variation. A newly proposed polygenic shape analysis indicates that the effects of the variants on facial shape in EAS can be generalized to EUR. Based on this, we further identified 13 variants related to differences between facial shape in EUR and EAS populations. Evolutionary analyses suggest that the difference in nose shape between EUR and EAS populations is caused by a directional selection, due mainly to a local adaptation in Europeans. Our results illustrate the underlying genetic basis for facial differences across populations.
Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ∼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from ...multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ∼34% of the ∼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of ...multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.
The majority of studies of genetic association with disease have been performed in Europeans. This European bias has important implications for risk prediction of diseases across global populations. ...In this commentary, we justify the need to study more diverse populations using both empirical examples and theoretical reasoning.
The majority of studies of genetic association with disease have been performed in Europeans. This European bias has important implications for risk prediction of diseases across global populations. In this commentary, we justify the need to study more diverse populations using both empirical examples and theoretical reasoning.
Abstract
Motivation
Current methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have ...essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers.
Results
We introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing.
Availability and implementation
Software, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl.
Supplementary information
Supplementary data are available at Bioinformatics online.
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 ...release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
Both short and long sleep are associated with an adverse lipid profile, likely through different biological pathways. To elucidate the biology of sleep-associated adverse lipid profile, we conduct ...multi-ancestry genome-wide sleep-SNP interaction analyses on three lipid traits (HDL-c, LDL-c and triglycerides). In the total study sample (discovery + replication) of 126,926 individuals from 5 different ancestry groups, when considering either long or short total sleep time interactions in joint analyses, we identify 49 previously unreported lipid loci, and 10 additional previously unreported lipid loci in a restricted sample of European-ancestry cohorts. In addition, we identify new gene-sleep interactions for known lipid loci such as LPL and PCSK9. The previously unreported lipid loci have a modest explained variance in lipid levels: most notable, gene-short-sleep interactions explain 4.25% of the variance in triglyceride level. Collectively, these findings contribute to our understanding of the biological mechanisms involved in sleep-associated adverse lipid profiles.
Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated ...reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.