Motivation: An important question that has emerged from the recent success of genome-wide association studies (GWAS) is how to detect genetic signals beyond single markers genes in order to explore ...their combined effects on mediating complex diseases and traits. Integrative testing of GWAS association data with that from prior-knowledge databases and proteome studies has recently gained attention. These methodologies may hold promise for comprehensively examining the interactions between genes underlying the pathogenesis of complex diseases.
Methods: Here, we present a dense module searching (DMS) method to identify candidate subnetworks or genes for complex diseases by integrating the association signal from GWAS datasets into the human protein-protein interaction (PPI) network. The DMS method extensively searches for subnetworks enriched with low P-value genes in GWAS datasets. Compared with pathway-based approaches, this method introduces flexibility in defining a gene set and can effectively utilize local PPI information.
Results: We implemented the DMS method in an R package, which can also evaluate and graphically represent the results. We demonstrated DMS in two GWAS datasets for complex diseases, i.e. breast cancer and pancreatic cancer. For each disease, the DMS method successfully identified a set of significant modules and candidate genes, including some well-studied genes not detected in the single-marker analysis of GWA studies. Functional enrichment analysis and comparison with previously published methods showed that the genes we identified by DMS have higher association signal.
Availability:
dmGWAS package and documents are available at http://bioinfo.mc.vanderbilt.edu/dmGWAS.html.
Contact:
zhongming.zhao@vanderbilt.edu
Supplementary Information:
Supplementary data are available at Bioinformatics online.
Summary Background More than 1000 reports have been published in the past two decades on associations between variants in candidate genes and risk of breast cancer. Results have been generally ...inconsistent. We did a literature search and meta-analyses to provide a synopsis of the current understanding of the genetic architecture of breast-cancer risk. Methods A systematic literature search for candidate-gene association studies of breast-cancer risk was done in two stages, using PubMed on or before Feb 28, 2010. A total of 24 500 publications were identified, of which 1059 were deemed eligible for inclusion. Meta-analyses were done for 279 genetic variants in 128 candidate genes or chromosomal loci that had at least three data sources. Variants with significant associations by meta-analysis were assessed using the Venice criteria and scored as having strong, moderate, or weak cumulative evidence for an association with breast-cancer risk. Findings 51 variants in 40 genes showed significant associations with breast-cancer risk. Cumulative epidemiological evidence of an association was graded as strong for ten variants in six genes ( ATM, CASP8, CHEK2, CTLA4, NBN , and TP53 ), moderate for four variants in four genes ( ATM, CYP19A1, TERT , and XRCC3 ), and weak for 37 variants. Additionally, in meta-analyses that included a minimum of 10 000 cases and 10 000 controls, convincing evidence of no association with breast-cancer risk was identified for 45 variants in 37 genes. Interpretation Whereas most genetic variants assessed in previous candidate-gene studies showed no association with breast-cancer risk in meta-analyses, 14 variants in nine genes had moderate to strong evidence for an association. Further evaluation of these variants is warranted. Funding US National Cancer Institute.
Oral microbiome may play an important role in cancer pathogenesis. However, no study has prospectively investigated the association of the oral microbiome with subsequent risk of developing ...colorectal cancer (CRC). We conducted a nested case–control study including 231 incident CRC cases and 462 controls within the Southern Community Cohort Study with 75% of the subjects being African‐Americans. The controls were individually matched to cases based on age, ethnic group, smoking, season‐of‐study enrollment and recruitment method. Oral microbiota were assessed using 16S rRNA gene sequencing in pre‐diagnostic mouth rinse samples. Multiple bacterial taxa showed an association with CRC risk at p <0.05. Oral pathogens Treponema denticola and Prevotella intermedia were associated with an increased risk of CRC, with odds ratios (ORs) and 95% confidence intervals (CIs) of 1.76(1.19–2.60) and 1.55(1.08–2.22), respectively, for the individuals carrying these bacteria compared to non‐carriers. In the phylum Actinobacteria, Bifidobacteriaceae was more abundant among CRC patients than among controls. In the phylum Bacteroidetes, Prevotella denticola and Prevotella sp. oral taxon 300 were associated with an increased CRC risk, while Prevotella melaninogenica was associated with a decreased risk of CRC. In the phylum Firmicutes, Carnobacteriaceae, Streptococcaceae, Erysipelotrichaceae, Streptococcus, Solobacterium, Streptococcus sp. oral taxon 058 and Solobacterium moorei showed associations with a decreased risk of CRC. Most of these associations were observed among both African‐ and European‐Americans. Most of the associations were not significant after Bonferroni correction for multiple testing, which may be conservative. Our study suggests that the oral microbiome may play a significant role in CRC etiology.
What's new?
Oral microbiome composition is of special interest in research on colorectal cancer risk, owing in part to evidence that certain oral microbes cause chronic inflammation. In this prospective investigation of participants in the Southern Community Cohort Study in the United States, multiple oral bacterial taxa were associated with risk of developing colorectal cancer. Among the taxa, two species, Treponema denticola and Prevotella intermedia, previously were described as pathogenic in the oral cavity. The findings suggest that oral microbiome composition influences colorectal cancer risk, warranting further investigation for the possible use of oral microbiota in colorectal cancer detection and prevention.
Accurate calling of SNPs and genotypes from next-generation sequencing data is an essential prerequisite for most human genetics studies. A number of computational steps are required or recommended ...when translating the raw sequencing data into the final calls. However, whether each step does contribute to the performance of variant calling and how it affects the accuracy still remain unclear, making it difficult to select and arrange appropriate steps to derive high quality variants from different sequencing data. In this study, we made a systematic assessment of the relative contribution of each step to the accuracy of variant calling from Illumina DNA sequencing data.
We found that the read preprocessing step did not improve the accuracy of variant calling, contrary to the general expectation. Although trimming off low-quality tails helped align more reads, it introduced lots of false positives. The ability of markup duplication, local realignment and recalibration, to help eliminate false positive variants depended on the sequencing depth. Rearranging these steps did not affect the results. The relative performance of three popular multi-sample SNP callers, SAMtools, GATK, and GlfMultiples, also varied with the sequencing depth.
Our findings clarify the necessity and effectiveness of computational steps for improving the accuracy of SNP and genotype calls from Illumina sequencing data and can serve as a general guideline for choosing SNP calling strategies for data with different coverage.
Exome sequencing using next-generation sequencing technologies is a cost efficient approach to selectively sequencing coding regions of human genome for detection of disease variants. A significant ...amount of DNA fragments from the capture process fall outside target regions, and sequence data for positions outside target regions have been mostly ignored after alignment.
We performed whole exome sequencing on 22 subjects using Agilent SureSelect capture reagent and 6 subjects using Illumina TrueSeq capture reagent. We also downloaded sequencing data for 6 subjects from the 1000 Genomes Project Pilot 3 study. Using these data, we examined the quality of SNPs detected outside target regions by computing consistency rate with genotypes obtained from SNP chips or the Hapmap database, transition-transversion (Ti/Tv) ratio, and percentage of SNPs inside dbSNP. For all three platforms, we obtained high-quality SNPs outside target regions, and some far from target regions. In our Agilent SureSelect data, we obtained 84,049 high-quality SNPs outside target regions compared to 65,231 SNPs inside target regions (a 129% increase). For our Illumina TrueSeq data, we obtained 222,171 high-quality SNPs outside target regions compared to 95,818 SNPs inside target regions (a 232% increase). For the data from the 1000 Genomes Project, we obtained 7,139 high-quality SNPs outside target regions compared to 1,548 SNPs inside target regions (a 461% increase).
These results demonstrate that a significant amount of high quality genotypes outside target regions can be obtained from exome sequencing data. These data should not be ignored in genetic epidemiology studies.
When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other ...heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias.
We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers.
Extreme strand bias indicates a potential high false-positive rate for SNPs.
Breast cancer mortality is primarily due to metastasis rather than primary tumors, yet relatively little is understood regarding the etiology of metastatic breast cancer. Previously, using a mouse ...genetics approach, we demonstrated that inherited germline polymorphisms contribute to metastatic disease, and that these single nucleotide polymorphisms (SNPs) could be used to predict outcome in breast cancer patients. In this study, a backcross between a highly metastatic (FVB/NJ) and low metastatic (MOLF/EiJ) mouse strain identified Arntl2, a gene encoding a circadian rhythm transcription factor, as a metastasis susceptibility gene associated with progression, specifically in estrogen receptor-negative breast cancer patients. Integrated whole genome sequence analysis with DNase hypersensitivity sites reveals SNPs in the predicted promoter of Arntl2. Using CRISPR/Cas9-mediated substitution of the MOLF promoter, we demonstrate that the SNPs regulate Arntl2 transcription and affect metastatic burden. Finally, analysis of SNPs associated with ARNTL2 expression in human breast cancer patients revealed reproducible associations of ARNTL2 expression quantitative trait loci (eQTL) SNPs with disease-free survival, consistent with the mouse studies.
The genetic mechanisms underlying the poor prognosis of esophageal squamous cell carcinoma (ESCC) are not well understood. Here, we report somatic mutations found in ESCC from sequencing 10 ...whole-genome and 57 whole-exome matched tumor-normal sample pairs. Among the identified genes, we characterized mutations in VANGL1 and showed that they accelerated cell growth in vitro. We also found that five other genes, including three coding genes (SHANK2, MYBL2, FADD) and two non-coding genes (miR-4707-5p, PCAT1), were involved in somatic copy-number alterations (SCNAs) or structural variants (SVs). A survival analysis based on the expression profiles of 321 individuals with ESCC indicated that these genes were significantly associated with poorer survival. Subsequently, we performed functional studies, which showed that miR-4707-5p and MYBL2 promoted proliferation and metastasis. Together, our results shed light on somatic mutations and genomic events that contribute to ESCC tumorigenesis and prognosis and might suggest therapeutic targets.
We carried out a genome-wide association study among Chinese women to identify risk variants for breast cancer. After analyzing 607,728 SNPs in 1,505 cases and 1,522 controls, we selected 29 SNPs for ...a fast-track replication in an independent set of 1,554 cases and 1,576 controls. We further investigated four replicated loci in a third set of samples comprising 3,472 cases and 900 controls. SNP rs2046210 at 6q25.1, located upstream of the gene encoding estrogen receptor α (ESR1), showed strong and consistent association with breast cancer across all three stages. Adjusted odds ratio (95% CI) were 1.36 (1.24-1.49) and 1.59 (1.40-1.82), respectively, for genotypes A/G and A/A versus G/G (P for trend 2.0 × 10−15) in the pooled analysis of samples from all three stages. We also found a similar, albeit weaker, association in an independent study comprising 1,591 cases and 1,466 controls of European ancestry (Ptrend = 0.01). These results strongly implicate 6q25.1 as a susceptibility locus for breast cancer.
Genome-wide association studies (GWASs) have identified multiple genetic susceptibility loci for breast cancer. However, these loci explain only a small fraction of the heritability. Very few studies ...have evaluated copy number variation (CNV), another important source of human genetic variation, in relation to breast cancer risk.
We conducted a CNV GWAS in 2623 breast cancer patients and 1946 control subjects using data from Affymetrix SNP Array 6.0 (stage 1). We then replicated the most promising CNV using real-time quantitative polymerase chain reaction (qPCR) in an independent set of 4254 case patients and 4387 control subjects (stage 2). All subjects were recruited from population-based studies conducted among Chinese women in Shanghai.
Of the 268 common CNVs (minor allele frequency ≥ 5%) investigated in stage 1, the strongest association was found for a common deletion in the APOBEC3 genes (P = 1.1×10(-4)) and was replicated in stage 2 (odds ratio =1.35, 95% confidence interval CI = 1.27 to 1.44; P = 9.6×10(-22)). Analyses of all samples from both stages using qPCR data produced odds ratios of 1.31 (95% CI = 1.21 to 1.42) for a one-copy deletion and 1.76 (95% CI = 1.57 to 1.97) for a two-copy deletion (P = 2.0×10(-24)).
We provide convincing evidence for a novel breast cancer locus at the APOBEC3 genes. This CNV is one of the strongest common genetic risk variants identified so far for breast cancer.