Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of ...SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
Abstract
Prediction of disease risk is an essential part of preventative medicine, often guiding clinical management. Risk prediction typically includes risk factors such as age, sex, family history ...of disease and lifestyle (e.g. smoking status); however, in recent years, there has been increasing interest to include genomic information into risk models. Polygenic risk scores (PRS) aggregate the effects of many genetic variants across the human genome into a single score and have recently been shown to have predictive value for multiple common diseases. In this review, we summarize the potential use cases for seven common diseases (breast cancer, prostate cancer, coronary artery disease, obesity, type 1 diabetes, type 2 diabetes and Alzheimer’s disease) where PRS has or could have clinical utility. PRS analysis for these diseases frequently revolved around (i) risk prediction performance of a PRS alone and in combination with other non-genetic risk factors, (ii) estimation of lifetime risk trajectories, (iii) the independent information of PRS and family history of disease or monogenic mutations and (iv) estimation of the value of adding a PRS to specific clinical risk prediction scenarios. We summarize open questions regarding PRS usability, ancestry bias and transferability, emphasizing the need for the next wave of studies to focus on the implementation and health-economic value of PRS testing. In conclusion, it is becoming clear that PRS have value in disease risk prediction and there are multiple areas where this may have clinical utility.
Recent genome-wide association studies in stroke have enabled the generation of genomic risk scores (GRS) but their predictive power has been modest compared to established stroke risk factors. Here, ...using a meta-scoring approach, we develop a metaGRS for ischaemic stroke (IS) and analyse this score in the UK Biobank (n = 395,393; 3075 IS events by age 75). The metaGRS hazard ratio for IS (1.26, 95% CI 1.22-1.31 per metaGRS standard deviation) doubles that of a previous GRS, identifying a subset of individuals at monogenic levels of risk: the top 0.25% of metaGRS have three-fold risk of IS. The metaGRS is similarly or more predictive compared to several risk factors, such as family history, blood pressure, body mass index, and smoking. We estimate the reductions needed in modifiable risk factors for individuals with different levels of genomic risk and suggest that, for individuals with high metaGRS, achieving risk factor levels recommended by current guidelines may be insufficient to mitigate risk.
Genetics plays an important role in coronary heart disease (CHD) but the clinical utility of genomic risk scores (GRSs) relative to clinical risk scores, such as the Framingham Risk Score (FRS), is ...unclear. Our aim was to construct and externally validate a CHD GRS, in terms of lifetime CHD risk and relative to traditional clinical risk scores.
We generated a GRS of 49 310 SNPs based on a CARDIoGRAMplusC4D Consortium meta-analysis of CHD, then independently tested it using five prospective population cohorts (three FINRISK cohorts, combined n = 12 676, 757 incident CHD events; two Framingham Heart Study cohorts (FHS), combined n = 3406, 587 incident CHD events). The GRS was associated with incident CHD (FINRISK HR = 1.74, 95% confidence interval (CI) 1.61-1.86 per S.D. of GRS; Framingham HR = 1.28, 95% CI 1.18-1.38), and was largely unchanged by adjustment for known risk factors, including family history. Integration of the GRS with the FRS or ACC/AHA13 scores improved the 10 years risk prediction (meta-analysis C-index: +1.5-1.6%, P < 0.001), particularly for individuals ≥60 years old (meta-analysis C-index: +4.6-5.1%, P < 0.001). Importantly, the GRS captured substantially different trajectories of absolute risk, with men in the top 20% of attaining 10% cumulative CHD risk 12-18 y earlier than those in the bottom 20%. High genomic risk was partially compensated for by low systolic blood pressure, low cholesterol level, and non-smoking.
A GRS based on a large number of SNPs improves CHD risk prediction and encodes different trajectories of lifetime risk not captured by traditional clinical risk scores.
A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic‐net penalized ...support‐vector machine models, a mixed‐effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false‐positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome‐wide SNP profiles across eight complex diseases within cross‐validation, lasso and elastic‐net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.
Polygenic risk scores (PRSs) can stratify populations into cardiovascular disease (CVD) risk groups. We aimed to quantify the potential advantage of adding information on PRSs to conventional risk ...factors in the primary prevention of CVD.
Using data from UK Biobank on 306,654 individuals without a history of CVD and not on lipid-lowering treatments (mean age SD: 56.0 8.0 years; females: 57%; median follow-up: 8.1 years), we calculated measures of risk discrimination and reclassification upon addition of PRSs to risk factors in a conventional risk prediction model (i.e., age, sex, systolic blood pressure, smoking status, history of diabetes, and total and high-density lipoprotein cholesterol). We then modelled the implications of initiating guideline-recommended statin therapy in a primary care setting using incidence rates from 2.1 million individuals from the Clinical Practice Research Datalink. The C-index, a measure of risk discrimination, was 0.710 (95% CI 0.703-0.717) for a CVD prediction model containing conventional risk predictors alone. Addition of information on PRSs increased the C-index by 0.012 (95% CI 0.009-0.015), and resulted in continuous net reclassification improvements of about 10% and 12% in cases and non-cases, respectively. If a PRS were assessed in the entire UK primary care population aged 40-75 years, assuming that statin therapy would be initiated in accordance with the UK National Institute for Health and Care Excellence guidelines (i.e., for persons with a predicted risk of ≥10% and for those with certain other risk factors, such as diabetes, irrespective of their 10-year predicted risk), then it could help prevent 1 additional CVD event for approximately every 5,750 individuals screened. By contrast, targeted assessment only among people at intermediate (i.e., 5% to <10%) 10-year CVD risk could help prevent 1 additional CVD event for approximately every 340 individuals screened. Such a targeted strategy could help prevent 7% more CVD events than conventional risk prediction alone. Potential gains afforded by assessment of PRSs on top of conventional risk factors would be about 1.5-fold greater than those provided by assessment of C-reactive protein, a plasma biomarker included in some risk prediction guidelines. Potential limitations of this study include its restriction to European ancestry participants and a lack of health economic evaluation.
Our results suggest that addition of PRSs to conventional risk factors can modestly enhance prediction of first-onset CVD and could translate into population health benefits if used at scale.
Juvenile idiopathic arthritis (JIA) is an autoimmune disease and a common cause of chronic disability in children. Diagnosis of JIA is based purely on clinical symptoms, which can be variable, ...leading to diagnosis and treatment delays. Despite JIA having substantial heritability, the construction of genomic risk scores (GRSs) to aid or expedite diagnosis has not been assessed. Here, we generate GRSs for JIA and its subtypes and evaluate their performance.
We examined three case/control cohorts (UK, US-based and Australia) with genome-wide single nucleotide polymorphism (SNP) genotypes. We trained GRSs for JIA and its subtypes using lasso-penalised linear models in cross-validation on the UK cohort, and externally tested it in the other cohorts.
The JIA GRS alone achieved cross-validated area under the receiver operating characteristic curve (AUC)=0.670 in the UK cohort and externally-validated AUCs of 0.657 and 0.671 in the US-based and Australian cohorts, respectively. In logistic regression of case/control status, the corresponding odds ratios (ORs) per standard deviation (SD) of GRS were 1.831 (1.685 to 1.991) and 2.008 (1.731 to 2.345), and were unattenuated by adjustment for sex or the top 10 genetic principal components. Extending our analysis to JIA subtypes revealed that the enthesitis-related JIA had both the longest time-to-referral and the subtype GRS with the strongest predictive capacity overall across data sets: AUCs 0.82 in UK; 0.84 in Australian; and 0.70 in US-based. The particularly common oligoarthritis JIA also had a GRS that outperformed those for JIA overall, with AUCs of 0.72, 0.74 and 0.77, respectively.
A GRS for JIA has potential to augment clinical JIA diagnosis protocols, prioritising higher-risk individuals for follow-up and treatment. Consistent with JIA heterogeneity, subtype-specific GRSs showed particularly high performance for enthesitis-related and oligoarthritis JIA.
A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us ...to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.
We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation.
Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP.
Traditional genome-wide scans for positive selection have mainly uncovered selective sweeps associated with monogenic traits. While selection on quantitative traits is much more common, very few ...signals have been detected because of their polygenic nature. We searched for positive selection signals underlying coronary artery disease (CAD) in worldwide populations, using novel approaches to quantify relationships between polygenic selection signals and CAD genetic risk. We identified new candidate adaptive loci that appear to have been directly modified by disease pressures given their significant associations with CAD genetic risk. These candidates were all uniquely and consistently associated with many different male and female reproductive traits suggesting selection may have also targeted these because of their direct effects on fitness. We found that CAD loci are significantly enriched for lifetime reproductive success relative to the rest of the human genome, with evidence that the relationship between CAD and lifetime reproductive success is antagonistic. This supports the presence of antagonistic-pleiotropic tradeoffs on CAD loci and provides a novel explanation for the maintenance and high prevalence of CAD in modern humans. Lastly, we found that positive selection more often targeted CAD gene regulatory variants using HapMap3 lymphoblastoid cell lines, which further highlights the unique biological significance of candidate adaptive loci underlying CAD. Our study provides a novel approach for detecting selection on polygenic traits and evidence that modern human genomes have evolved in response to CAD-induced selection pressures and other early-life traits sharing pleiotropic links with CAD.