While major inroads have been made in identifying the genetic causes of rare Mendelian disorders, little progress has been made in the discovery of common gene variations that predispose to complex ...diseases. The single gene variants that have been shown to associate reproducibly with complex diseases typically have small effect sizes or attributable risks. However, the joint actions of common gene variants within pathways may play a major role in predisposing to complex diseases (the paradigm of complex genetics). The goal of this study was to determine whether polymorphism in a candidate pathway (axon guidance) predisposed to a complex disease (Parkinson disease PD). We mined a whole-genome association dataset and identified single nucleotide polymorphisms (SNPs) that were within axon-guidance pathway genes. We then constructed models of axon-guidance pathway SNPs that predicted three outcomes: PD susceptibility (odds ratio = 90.8, p = 4.64 x 10(-38)), survival free of PD (hazards ratio = 19.0, p = 5.43 x 10(-48)), and PD age at onset (R(2) = 0.68, p = 1.68 x 10(-51)). By contrast, models constructed from thousands of random selections of genomic SNPs predicted the three PD outcomes poorly. Mining of a second whole-genome association dataset and mining of an expression profiling dataset also supported a role for many axon-guidance pathway genes in PD. These findings could have important implications regarding the pathogenesis of PD. This genomic pathway approach may also offer insights into other complex diseases such as Alzheimer disease, diabetes mellitus, nicotine and alcohol dependence, and several cancers.
Objectives We sought to identify a novel gene for dilated cardiomyopathy (DCM). Background DCM is a heritable, genetically heterogeneous disorder that remains idiopathic in the majority of patients. ...Familial cases provide an opportunity to discover unsuspected molecular bases of DCM, enabling pre-clinical risk detection. Methods Two large families with autosomal-dominant DCM were studied. Genome-wide linkage analysis was used to identify a disease locus, followed by fine mapping and positional candidate gene sequencing. Mutation scanning was then performed in 278 unrelated subjects with idiopathic DCM, prospectively identified at the Mayo Clinic. Results Overlapping loci for DCM were independently mapped to chromosome 10q25-q26. Deoxyribonucleic acid sequencing of affected individuals in each family revealed distinct heterozygous missense mutations in exon 9 of RBM20 , encoding ribonucleic acid (RNA) binding motif protein 20. Comprehensive coding sequence analyses identified missense mutations clustered within this same exon in 6 additional DCM families. Mutations segregated with DCM (peak composite logarithm of the odds score >11.49), were absent in 480 control samples, and altered residues within a highly conserved arginine/serine (RS)-rich region. Expression of RBM20 messenger RNA was confirmed in human heart tissue. Conclusions Our findings establish RBM20 as a DCM gene and reveal a mutation hotspot in the RS domain. RBM20 is preferentially expressed in the heart and encodes motifs prototypical of spliceosome proteins that regulate alternative pre-messenger RNA splicing, thus implicating a functionally distinct gene in human cardiomyopathy. RBM20 mutations are associated with young age at diagnosis, end-stage heart failure, and high mortality.
Objective We compared the clinical outcomes and changes in pulmonary function test (PFT) results after segmentectomy or lobectomy for non–small cell lung cancer. Methods The retrospective study ...included 212 patients who had undergone segmentectomy (group S) and 2336 patients who had undergone lobectomy (group L) from 1997 to 2012. The follow-up and medical record data were collected. We used all the longitudinal PFT data within 24 months postoperatively and performed linear mixed modeling. We analyzed the 5-year overall and disease-free survival in stage IA patients. We used propensity score case matching to minimize the bias due to imbalanced group comparisons. Results During the perioperative period, 1 death (0.4%) in group S and 7 (0.3%) in group L occurred. The hospital stay for the 2 groups was similar (median, 5.0 vs 5.0 days; range, 2-99 vs 2-58). The mean overall and disease-free survival period of those with T1a after segmentectomy or lobectomy seemed to be similar (4.2 vs 4.5 years, P = .06; and 4.1 vs 4.4 years, P = .07, respectively). Compared with segmentectomy, lobectomy yielded marginally significantly better overall (4.4 vs 3.9 years, P = .05) and disease-free (4.1 vs 3.6 years; P = .05) survival in those with T1b. We did not find a significantly different effect on the PFTs after segmentectomy or lobectomy. Conclusions Both surgical types were safe. We would advocate lobectomy for patients with stage IA disease, especially those with T1b. A retrospective study with a large sample size and more detailed information should be conducted for PFT evaluation, with additional stratification by lobe and laterality.
Summary Background Lung cancer in individuals who have never smoked tobacco products is an increasing medical and public-health issue. We aimed to unravel the genetic basis of lung cancer in never ...smokers. Methods We did a four-stage investigation. First, a genome-wide association study of single nucleotide polymorphisms (SNPs) was done with 754 never smokers (377 matched case-control pairs at Mayo Clinic, Rochester, MN, USA). Second, the top candidate SNPs from the first study were validated in two independent studies among 735 (MD Anderson Cancer Center, Houston, TX, USA) and 253 (Harvard University, Boston, MA, USA) never smokers. Third, further replication of the top SNP was done in 530 never smokers (UCLA, Los Angeles, CA, USA). Fourth, expression quantitative trait loci (eQTL) and gene-expression differences were analysed to further elucidate the causal relation between the validated SNPs and the risk of lung cancer in never smokers. Findings 44 top candidate SNPs were identified that might alter the risk of lung cancer in never smokers. rs2352028 at chromosome 13q31.3 was subsequently replicated with an additive genetic model in the four independent studies, with a combined odds ratio of 1·46 (95% CI 1·26–1·70, p=5·94×10−6 ). A cis eQTL analysis showed there was a strong correlation between genotypes of the replicated SNPs and the transcription level of the gene GPC5 in normal lung tissues (p=1·96×10−4 ), with the high-risk allele linked with lower expression. Additionally, the transcription level of GPC5 in normal lung tissue was twice that detected in matched lung adenocarcinoma tissue (p=6·75×10−11 ). Interpretation Genetic variants at 13q31.3 alter the expression of GPC5 , and are associated with susceptibility to lung cancer in never smokers. Downregulation of GPC5 might contribute to the development of lung cancer in never smokers. Funding US National Institutes of Health; Mayo Foundation.
Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants ...and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique.
We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes.
Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes.
As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation.
A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.
We performed a two-tiered, whole-genome association study of Parkinson disease (PD). For tier 1, we individually genotyped 198,345 uniformly spaced and informative single-nucleotide polymorphisms ...(SNPs) in 443 sibling pairs discordant for PD. For tier 2a, we individually genotyped 1,793 PD-associated SNPs (P<.01 in tier 1) and 300 genomic control SNPs in 332 matched case–unrelated control pairs. We identified 11 SNPs that were associated with PD (P<.01) in both tier 1 and tier 2 samples and had the same direction of effect. For these SNPs, we combined data from the case–unaffected sibling pair (tier 1) and case–unrelated control pair (tier 2) samples and employed a liberalization of the sibling transmission/disequilibrium test to calculate odds ratios, 95% confidence intervals, and P values. A SNP within the semaphorin 5A gene (SEMA5A) had the lowest combined P value (P=7.62×10−6). The protein encoded by this gene plays an important role in neurogenesis and in neuronal apoptosis, which is consistent with existing hypotheses regarding PD pathogenesis. A second SNP tagged the PARK11 late-onset PD susceptibility locus (P=1.70×10−5). In tier 2b, we also selected for genotyping additional SNPs that were borderline significant (P<.05) in tier 1 but that tested a priori biological and genetic hypotheses regarding susceptibility to PD (n=941 SNPs). In analysis of the combined tier 1 and tier 2b data, the two SNPs with the lowest P values (P=9.07×10−6; P=2.96×10−5) tagged the PARK10 late-onset PD susceptibility locus. Independent replication across populations will clarify the role of the genomic loci tagged by these SNPs in conferring PD susceptibility.
Abstract Sex differences in incidence and prevalence of and morbidity and mortality from cardiovascular disease are well documented. However, many studies examining the genetic basis for ...cardiovascular disease fail to consider sex as a variable in the study design, in part, because there is an inherent difficulty in studying the contribution of the sex chromosomes in women due to X chromosome inactivation. This paper will provide general background on the X and Y chromosomes (including gene content, the pseudoautosomal regions, and X chromosome inactivation), discuss how sex chromosomes have been ignored in Genome-wide Association Studies (GWAS) of cardiovascular diseases, and discuss genetics influencing development of cardiovascular risk factors and atherosclerosis with particular attention to carotid intima-medial thickness, and coronary arterial calcification based on sex-specific studies. In addition, a brief discussion of how ethnicity and hormonal status act as confounding variables in sex-based analysis will be considered along with methods for statistical analysis to account for sex in cardiovascular disease.
Platelets are enucleated cell fragments derived from megakaryocytes that play key roles in hemostasis and in the pathogenesis of atherothrombosis and cancer. Platelet traits are highly heritable and ...identification of genetic variants associated with platelet traits and assessing their pleiotropic effects may help to understand the role of underlying biological pathways. We conducted an electronic medical record (EMR)-based study to identify common variants that influence inter-individual variation in the number of circulating platelets (PLT) and mean platelet volume (MPV), by performing a genome-wide association study (GWAS). We characterized genetic variants associated with MPV and PLT using functional, pathway and disease enrichment analyses; we assessed pleiotropic effects of such variants by performing a phenome-wide association study (PheWAS) with a wide range of EMR-derived phenotypes. A total of 13,582 participants in the electronic MEdical Records and GEnomic network had data for PLT and 6,291 participants had data for MPV. We identified five chromosomal regions associated with PLT and eight associated with MPV at genome-wide significance (
P
< 5E−8). In addition, we replicated 20 SNPs out of 56 SNPs (
α
: 0.05/56 = 9E−4) influencing PLT and 22 SNPs out of 29 SNPs (
α
: 0.05/29 = 2E−3) influencing MPV in a published meta-analysis of GWAS of PLT and MPV. While our GWAS did not find any new associations, our functional analyses revealed that genes in these regions influence thrombopoiesis and encode kinases, membrane proteins, proteins involved in cellular trafficking, transcription factors, proteasome complex subunits, proteins of signal transduction pathways, proteins involved in megakaryocyte development, and platelet production and hemostasis. PheWAS using a single-SNP Bonferroni correction for 1,368 diagnoses (0.05/1368 = 3.6E−5) revealed that several variants in these genes have pleiotropic associations with myocardial infarction, autoimmune, and hematologic disorders. We conclude that multiple genetic loci influence interindividual variation in platelet traits and also have significant pleiotropic effects; the related genes are in multiple functional pathways including those relevant to thrombopoiesis.
To provide a validated method to confidently identify exon-containing copy-number variants (CNVs), with a low false discovery rate (FDR), in targeted sequencing data from a clinical laboratory with ...particular focus on single-exon CNVs.
DNA sequence coverage data are normalized within each sample and subsequently exonic CNVs are identified in a batch of samples, when the target log2 ratio of the sample to the batch median exceeds defined thresholds. The quality of exonic CNV calls is assessed by C-scores (Z-like scores) using thresholds derived from gold standard samples and simulation studies. We integrate an ExonQC threshold to lower FDR and compare performance with alternate software (VisCap).
Thirteen CNVs were used as a truth set to validate Atlas-CNV and compared with VisCap. We demonstrated FDR reduction in validation, simulation, and 10,926 eMERGESeq samples without sensitivity loss. Sixty-four multiexon and 29 single-exon CNVs with high C-scores were assessed by Multiplex Ligation-dependent Probe Amplification (MLPA).
Atlas-CNV is validated as a method to identify exonic CNVs in targeted sequencing data generated in the clinical laboratory. The ExonQC and C-score assignment can reduce FDR (identification of targets with high variance) and improve calling accuracy of single-exon CNVs respectively. We propose guidelines and criteria to identify high confidence single-exon CNVs.
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from ...distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.