Significance testing was developed as an objective method for summarizing statistical evidence for a hypothesis. It has been widely adopted in genetic studies, including genome-wide association ...studies and, more recently, exome sequencing studies. However, significance testing in both genome-wide and exome-wide studies must adopt stringent significance thresholds to allow multiple testing, and it is useful only when studies have adequate statistical power, which depends on the characteristics of the phenotype and the putative genetic variant, as well as the study design. Here, we review the principles and applications of significance testing and power calculation, including recently proposed gene-based tests for rare variants.
The gene has been proposed as an attractive unit of analysis for association studies, but a simple yet valid, powerful, and sufficiently fast method of evaluating the statistical significance of all ...genes in large, genome-wide datasets has been lacking. Here we propose the use of an extended Simes test that integrates functional information and association evidence to combine the p values of the single nucleotide polymorphisms within a gene to obtain an overall p value for the association of the entire gene. Our computer simulations demonstrate that this test is more powerful than the SNP-based test, offers effective control of the type 1 error rate regardless of gene size and linkage-disequilibrium pattern among markers, and does not need permutation or simulation to evaluate empirical significance. Its statistical power in simulated data is at least comparable, and often superior, to that of several alternative gene-based tests. When applied to real genome-wide association study (GWAS) datasets on Crohn disease, the test detected more significant genes than SNP-based tests and alternative gene-based tests. The proposed test, implemented in an open-source package, has the potential to identify additional novel disease-susceptibility genes for complex diseases from large GWAS datasets.
Current genome-wide association studies (GWAS) use commercial genotyping microarrays that can assay over a million single nucleotide polymorphisms (SNPs). The number of SNPs is further boosted by ...advanced statistical genotype-imputation algorithms and large SNP databases for reference human populations. The testing of a huge number of SNPs needs to be taken into account in the interpretation of statistical significance in such genome-wide studies, but this is complicated by the non-independence of SNPs because of linkage disequilibrium (LD). Several previous groups have proposed the use of the effective number of independent markers (
M
e
) for the adjustment of multiple testing, but current methods of calculation for
M
e
are limited in accuracy or computational speed. Here, we report a more robust and fast method to calculate
M
e
. Applying this efficient method implemented in a free software tool named Genetic type 1 error calculator (GEC), we systematically examined the
M
e
, and the corresponding
p
-value thresholds required to control the genome-wide type 1 error rate at 0.05, for 13 Illumina or Affymetrix genotyping arrays, as well as for HapMap Project and 1000 Genomes Project datasets which are widely used in genotype imputation as reference panels. Our results suggested the use of a
p
-value threshold of ~10
−7
as the criterion for genome-wide significance for early commercial genotyping arrays, but slightly more stringent
p
-value thresholds ~5 × 10
−8
for current or merged commercial genotyping arrays, ~10
−8
for all common SNPs in the 1000 Genomes Project dataset and ~5 × 10
−8
for the common SNPs only within genes.
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets ...in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
Molecular genetic investigations of attention deficit hyperactivity disorder (ADHD) have found associations with a variable number of tandem repeat (VNTR) situated in the 3'-untranslated region of ...dopamine transporter gene (DAT1), a VNTR in exon 3 of dopamine receptor 4 gene (DRD4) and a microsatellite polymorphism located at 18.5 kb from the 5' end of dopamine receptor 5 gene (DRD5). A number of independent studies have attempted to replicate these findings but the results have been mixed, possibly reflecting inadequate statistical power and the use of different populations and methodologies. In an attempt to clarify this inconsistency, we have combined all the published studies of European and Asian populations up to October 2005 in a meta-analysis to give a comprehensive picture of the role of the three dopamine-related genes using multiple research methods and models. The DRD4 7-repeat (OR=1.34, 95% CI 1.23-1.45, P= 2 x 10(-12)) and 5-repeat (OR=1.68, 95% CI 1.17-2.41, P=0.005) alleles as well as the DRD5 148-bp allele (OR=1.34, 95% CI 1.21-1.49, P= 8 x 10(-8)) confer increased risk of ADHD, whereas the DRD4 4-repeat (OR=0.90, 95% CI 0.84-0.97, P=0.004) and DRD5 136-bp (OR=0.57, 95% CI 0.34-0.96, P=0.022) alleles have protective effects. In contrast, we found no compelling evidence for association with the 480-bp allele of DAT (OR=1.04, 95% CI 0.98-1.11, P=0.20). No significant publication bias was detected in current studies. In conclusion, there is a statistically significant association between ADHD and dopamine system genes, especially DRD4 and DRD5. These findings strongly implicate the involvement of brain dopamine systems in the pathogenesis of ADHD.
Exome sequencing strategy is promising for finding novel mutations of human monogenic disorders. However, pinpointing the casual mutation in a small number of samples is still a big challenge. Here, ...we propose a three-level filtration and prioritization framework to identify the casual mutation(s) in exome sequencing studies. This efficient and comprehensive framework successfully narrowed down whole exome variants to very small numbers of candidate variants in the proof-of-concept examples. The proposed framework, implemented in a user-friendly software package, named KGGSeq (http://statgenpro.psychiatry.hku.hk/kggseq), will play a very useful role in exome sequencing-based discovery of human Mendelian disease genes.
Historically, association tests were limited to single variants, so that the allele was considered the basic unit for association testing. As marker density increases and indirect approaches are used ...to assess association through linkage disequilibrium, association is now frequently considered at the haplotypic level. We suggest that there are difficulties in replicating association findings at the single-nucleotide–polymorphism (SNP) or the haplotype level, and we propose a shift toward a gene-based approach in which all common variation within a candidate gene is considered jointly. Inconsistencies arising from population differences are more readily resolved by use of a gene-based approach rather than either a SNP-based or a haplotype-based approach. A gene-based approach captures all of the potential risk-conferring variations; thus, negative findings are subject only to the issue of power. In addition, chance findings due to multiple testing can be readily accounted for by use of a genewide-significance level. Meta-analysis procedures can be formalized for gene-based methods through the combination of
P values. It is only a matter of time before all variation within genes is mapped, at which point the gene-based approach will become the natural end point for association analysis and will inform our search for functional variants relevant to disease etiology.
IMPORTANCE: Modeling genetic nurture (ie, the effects of parental genotypes through influences on the environment experienced by their children) is essential to accurately disentangle genetic and ...environmental influences on phenotypic variance. However, these influences are often ignored in both epidemiologic and genetic studies of depression. OBJECTIVE: To estimate the association of genetic nurture with depression and neuroticism. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study jointly modeled parental and offspring polygenic scores (PGSs) across 9 traits to test for the association of genetic nurture with lifetime broad depression and neuroticism using data from nuclear families in the UK Biobank, with data collected between 2006 and 2019. A broad depression phenotype was measured in 38 702 offspring from 20 905 independent nuclear families, with most of these participants also reporting neuroticism scores. Parental genotypes were imputed from sibships or parent-offspring duos and used to calculate parental PGSs. Data were analyzed between March 2021 and January 2023. MAIN OUTCOMES AND MEASURES: Estimates of genetic nurture and direct genetic regression coefficients on broad depression and neuroticism. RESULTS: This study of 38 702 offspring with data on broad depression (mean SD age, 55.5 8.2 years at study entry; 58% female) found limited preliminary evidence for a statistically significant association of genetic nurture with lifetime depression and neuroticism in adults. The estimated regression coefficient of the parental depression PGS on offspring neuroticism (β = 0.04, SE = 0.02, P = 6.63 × 10−3) was estimated to be approximately two-thirds (66%) that of the offspring’s depression PGS (β = 0.06, SE = 0.01, P = 6.13 × 10−11). Evidence for an association between parental cannabis use disorder PGS and offspring depression was also found (β = 0.08, SE = 0.03, P = .02), which was estimated to be 2 times greater than the association between the offspring’s cannabis use disorder PGS and their own depression status (β = 0.04, SE = 0.02, P = .07). CONCLUSIONS AND RELEVANCE: The results of this cross-sectional study highlight the potential for genetic nurture to bias results from epidemiologic and genetic studies on depression or neuroticism and, with further replication and larger samples, identify potential avenues for future prevention and intervention efforts.
Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach ...and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ~22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases.
Indirect parental genetic effects may be defined as the influence of parental genotypes on offspring phenotypes over and above that which results from the transmission of genes from parents to their ...children. However, given the relative paucity of large-scale family-based cohorts around the world, it is difficult to demonstrate parental genetic effects on human traits, particularly at individual loci. In this manuscript, we illustrate how parental genetic effects on offspring phenotypes, including late onset conditions, can be estimated at individual loci in principle using large-scale genome-wide association study (GWAS) data, even in the absence of parental genotypes. Our strategy involves creating "virtual" mothers and fathers by estimating the genotypic dosages of parental genotypes using physically genotyped data from relative pairs. We then utilize the expected dosages of the parents, and the actual genotypes of the offspring relative pairs, to perform conditional genetic association analyses to obtain asymptotically unbiased estimates of maternal, paternal and offspring genetic effects. We apply our approach to 19066 sibling pairs from the UK Biobank and show that a polygenic score consisting of imputed parental educational attainment SNP dosages is strongly related to offspring educational attainment even after correcting for offspring genotype at the same loci. We develop a freely available web application that quantifies the power of our approach using closed form asymptotic solutions. We implement our methods in a user-friendly software package IMPISH (IMputing Parental genotypes In Siblings and Half Siblings) which allows users to quickly and efficiently impute parental genotypes across the genome in large genome-wide datasets, and then use these estimated dosages in downstream linear mixed model association analyses. We conclude that imputing parental genotypes from relative pairs may provide a useful adjunct to existing large-scale genetic studies of parents and their offspring.