Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation ...and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Abstract
Background
There is an urgent need to identify factors specifically associated with aggressive prostate cancer (PCa) risk. We investigated whether rare pathogenic, likely pathogenic, or ...deleterious (P/LP/D) germline variants in DNA repair genes are associated with aggressive PCa risk in a case-case study of aggressive vs nonaggressive disease.
Methods
Participants were 5545 European-ancestry men, including 2775 nonaggressive and 2770 aggressive PCa cases, which included 467 metastatic cases (16.9%). Samples were assembled from 12 international studies and germline sequenced together. Rare (minor allele frequency < 0.01) P/LP/D variants were analyzed for 155 DNA repair genes. We compared single variant, gene-based, and DNA repair pathway-based burdens by disease aggressiveness. All statistical tests are 2-sided.
Results
BRCA2 and PALB2 had the most statistically significant gene-based associations, with 2.5% of aggressive and 0.8% of nonaggressive cases carrying P/LP/D BRCA2 alleles (odds ratio OR = 3.19, 95% confidence interval CI = 1.94 to 5.25, P = 8.58 × 10-7) and 0.65% of aggressive and 0.11% of nonaggressive cases carrying P/LP/D PALB2 alleles (OR = 6.31, 95% CI = 1.83 to 21.68, P = 4.79 × 10-4). ATM had a nominal association, with 1.6% of aggressive and 0.8% of nonaggressive cases carrying P/LP/D ATM alleles (OR = 1.88, 95% CI = 1.10 to 3.22, P = .02). In aggregate, P/LP/D alleles within 24 literature-curated candidate PCa DNA repair genes were more common in aggressive than nonaggressive cases (carrier frequencies = 14.2% vs 10.6%, respectively; P = 5.56 × 10-5). However, this difference was non-statistically significant (P = .18) on excluding BRCA2, PALB2, and ATM. Among these 24 genes, P/LP/D carriers had a 1.06-year younger diagnosis age (95% CI = -1.65 to 0.48, P = 3.71 × 10-4).
Conclusions
Risk conveyed by DNA repair genes is largely driven by rare P/LP/D alleles within BRCA2, PALB2, and ATM. These findings support the importance of these genes in both screening and disease management considerations.
In search of common risk alleles for prostate cancer that could contribute to high rates of the disease in men of African ancestry, we conducted a genome-wide association study, with 1,047,986 SNP ...markers examined in 3,425 African-Americans with prostate cancer (cases) and 3,290 African-American male controls. We followed up the most significant 17 new associations from stage 1 in 1,844 cases and 3,269 controls of African ancestry. We identified a new risk variant on chromosome 17q21 (rs7210100, odds ratio per allele = 1.51, P = 3.4 × 10(-13)). The frequency of the risk allele is ∼5% in men of African descent, whereas it is rare in other populations (<1%). Further studies are needed to investigate the biological contribution of this allele to prostate cancer risk. These findings emphasize the importance of conducting genome-wide association studies in diverse populations.
Investigating genetic architecture of complex traits in ancestrally diverse populations is imperative to understand the etiology of disease. However, the current paucity of genetic research in people ...of African and Latin American ancestry, Hispanic and indigenous peoples in the United States is likely to exacerbate existing health disparities for many common diseases. The Population Architecture using Genomics and Epidemiology, Phase II (PAGE II), Study was initiated in 2013 by the National Human Genome Research Institute to expand our understanding of complex trait loci in ethnically diverse and well characterized study populations. To meet this goal, the Multi-Ethnic Genotyping Array (MEGA) was designed to substantially improve fine-mapping and functional discovery by increasing variant coverage across multiple ethnicities at known loci for metabolic, cardiovascular, renal, inflammatory, anthropometric, and a variety of lifestyle traits. Studying the frequency distribution of clinically relevant mutations, putative risk alleles, and known functional variants across multiple populations will provide important insight into the genetic architecture of complex diseases and facilitate the discovery of novel, sometimes population-specific, disease associations. DNA samples from 51,650 self-identified African ancestry (17,328), Hispanic/Latino (22,379), Asian/Pacific Islander (8,640), and American Indian (653) and an additional 2,650 participants of either South Asian or European ancestry, and other reference panels have been genotyped on MEGA by PAGE II. MEGA was designed as a new resource for studying ancestrally diverse populations. Here, we describe the methodology for selecting trait-specific content for use in multi-ethnic populations and how enriching MEGA for this content may contribute to deeper biological understanding of the genetic etiology of complex disease.
Few studies have explored the genetic underpinnings of intra-abdominal visceral fat deposition, which varies substantially by sex and race/ethnicity. Among 1,787 participants in the Multiethnic ...Cohort (MEC)-Adiposity Phenotype Study (MEC-APS), we conducted a genome-wide association study (GWAS) of the percent visceral adiposity tissue (VAT) area out of the overall abdominal area, averaged across L1-L5 (%VAT), measured by abdominal magnetic resonance imaging (MRI). A genome-wide significant signal was found on chromosome 2q14.3 in the sex-combined GWAS (lead variant rs79837492: Beta per effect allele = -4.76; P = 2.62 × 10-8) and in the male-only GWAS (lead variant rs2968545: (Beta = -6.50; P = 1.09 × 10-9), and one suggestive variant was found at 13q12.11 in the female-only GWAS (rs79926925: Beta = 6.95; P = 8.15 × 10-8). The negatively associated variants were most common in European Americans (T allele of rs79837492; 5%) and African Americans (C allele of rs2968545; 5%) and not observed in Japanese Americans, whereas the positively associated variant was most common in Japanese Americans (C allele of rs79926925, 5%), which was all consistent with the racial/ethnic %VAT differences. In a validation step among UK Biobank participants (N = 23,699 of mainly British and Irish ancestry) with MRI-based VAT volume, both rs79837492 (Beta = -0.026, P = 0.019) and rs2968545 (Beta = -0.028, P = 0.010) were significantly associated in men only (n = 11,524). In the MEC-APS, the association between rs79926925 and plasma sex hormone binding globulin levels reached statistical significance in females, but not in males, with adjustment for total adiposity (Beta = -0.24; P = 0.028), on the log scale. Rs79837492 and rs2968545 are located in intron 5 of CNTNAP5, and rs79926925, in an intergenic region between GJB6 and CRYL1. These novel findings differing by sex and racial/ethnic group warrant replication in additional diverse studies with direct visceral fat measurements.
Little is known regarding the potential relationship between clonal hematopoiesis (CH) of indeterminate potential (CHIP), which is the expansion of hematopoietic stem cells with somatic mutations, ...and risk of prostate cancer, the fifth leading cause of cancer death of men worldwide. We evaluated the association of age-related CHIP with overall and aggressive prostate cancer risk in two large whole-exome sequencing studies of 75 047 European ancestry men, including 7663 prostate cancer cases, 2770 of which had aggressive disease, and 3266 men carrying CHIP variants. We found that CHIP, defined by over 50 CHIP genes individually and in aggregate, was not significantly associated with overall (aggregate HR = 0.93, 95% CI = 0.76-1.13, P = 0.46) or aggressive (aggregate OR = 1.14, 95% CI = 0.92-1.41, P = 0.22) prostate cancer risk. CHIP was weakly associated with genetic risk of overall prostate cancer, measured using a polygenic risk score (OR = 1.05 per unit increase, 95% CI = 1.01-1.10, P = 0.01). CHIP was not significantly associated with carrying pathogenic/likely pathogenic/deleterious variants in DNA repair genes, which have previously been found to be associated with aggressive prostate cancer. While findings from this study suggest that CHIP is likely not a risk factor for prostate cancer, it will be important to investigate other types of CH in association with prostate cancer risk.
Rare variation in protein coding sequence is poorly captured by GWAS arrays and has been hypothesized to contribute to disease heritability. Using the Illumina HumanExome SNP array, we successfully ...genotyped 191,032 common and rare non-synonymous, splice site, or nonsense variants in a multiethnic sample of 2,984 breast cancer cases, 4,376 prostate cancer cases, and 7,545 controls. In breast cancer, the strongest associations included either SNPs in or gene burden scores for genes LDLRAD1, SLC19A1, FGFBP3, CASP5, MMAB, SLC16A6, and INS-IGF2. In prostate cancer, one of the most associated SNPs was in the gene GPRC6A (rs2274911, Pro91Ser, OR = 0.88, P = 1.3 × 10(-5)) near to a known risk locus for prostate cancer; other suggestive associations were noted in genes such as F13A1, ANXA4, MANSC1, and GP6. For both breast and prostate cancer, several of the most significant associations involving SNPs or gene burden scores (sum of minor alleles) were noted in genes previously reported to be associated with a cancer-related phenotype. However, only one of the associations (rs145889899 in LDLRAD1, p = 2.5 × 10(-7) only seen in African Americans) for overall breast or prostate cancer risk was statistically significant after correcting for multiple comparisons. In addition to breast and prostate cancer, other cancer-related traits were examined (body mass index, PSA level, and alcohol drinking) with a number of known and potentially novel associations described. In general, these findings do not support there being many protein coding variants of moderate to high risk for breast and prostate cancer with odds ratios over a range that is probably required for protein coding variation to play a truly outstanding role in risk heritability. Very large sample sizes will be required to better define the role of rare and less penetrant coding variation in prostate and breast cancer disease genetics.
Several studies have found associations between higher pancreatic fat content and adverse health outcomes, such as diabetes and the metabolic syndrome, but investigations into the genetic ...contributions to pancreatic fat are limited. This genome-wide association study, comprised of 804 participants with MRI-assessed pancreatic fat measurements, was conducted in the ethnically diverse Multiethnic Cohort-Adiposity Phenotype Study (MEC-APS). Two genetic variants reaching genome-wide significance, rs73449607 on chromosome 13q21.2 (Beta = -0.67, P = 4.50x10-8) and rs7996760 on chromosome 6q14 (Beta = -0.90, P = 4.91x10-8) were associated with percent pancreatic fat on the log scale. Rs73449607 was most common in the African American population (13%) and rs79967607 was most common in the European American population (6%). Rs73449607 was also associated with lower risk of type 2 diabetes (OR = 0.95, 95% CI = 0.89-1.00, P = 0.047) in the Population Architecture Genomics and Epidemiology (PAGE) Study and the DIAbetes Genetics Replication and Meta-analysis (DIAGRAM), which included substantial numbers of non-European ancestry participants (53,102 cases and 193,679 controls). Rs73449607 is located in an intergenic region between GSX1 and PLUTO, and rs79967607 is in intron 1 of EPM2A. PLUTO, a lncRNA, regulates transcription of an adjacent gene, PDX1, that controls beta-cell function in the mature pancreas, and EPM2A encodes the protein laforin, which plays a critical role in regulating glycogen production. If validated, these variants may suggest a genetic component for pancreatic fat and a common etiologic link between pancreatic fat and type 2 diabetes.
Germline gene panel testing is recommended for men with advanced prostate cancer (PCa) or a family history of cancer. While evidence is limited for some genes currently included in panel testing, ...gene panels are also likely to be incomplete and missing genes that influence PCa risk and aggressive disease.
To identify genes associated with aggressive PCa.
A 2-stage exome sequencing case-only genetic association study was conducted including men of European ancestry from 18 international studies. Data analysis was performed from January 2021 to March 2023. Participants were 9185 men with aggressive PCa (including 6033 who died of PCa and 2397 with confirmed metastasis) and 8361 men with nonaggressive PCa.
Sequencing data were evaluated exome-wide and in a focused investigation of 29 DNA repair pathway and cancer susceptibility genes, many of which are included on gene panels.
The primary study outcomes were aggressive (category T4 or both T3 and Gleason score ≥8 tumors, metastatic PCa, or PCa death) vs nonaggressive PCa (category T1 or T2 and Gleason score ≤6 tumors without known recurrence), and metastatic vs nonaggressive PCa.
A total of 17 546 men of European ancestry were included in the analyses; mean (SD) age at diagnosis was 65.1 (9.2) years in patients with aggressive PCa and 63.7 (8.0) years in those with nonaggressive disease. The strongest evidence of association with aggressive or metastatic PCa was noted for rare deleterious variants in known PCa risk genes BRCA2 and ATM (P ≤ 1.9 × 10-6), followed by NBN (P = 1.7 × 10-4). This study found nominal evidence (P < .05) of association with rare deleterious variants in MSH2, XRCC2, and MRE11A. Five other genes had evidence of greater risk (OR≥2) but carrier frequency differences between aggressive and nonaggressive PCa were not statistically significant: TP53, RAD51D, BARD1, GEN1, and SLX4. Deleterious variants in these 11 candidate genes were carried by 2.3% of patients with nonaggressive, 5.6% with aggressive, and 7.0% with metastatic PCa.
The findings of this study provide further support for DNA repair and cancer susceptibility genes to better inform disease management in men with PCa and for extending testing to men with nonaggressive disease, as men carrying deleterious alleles in these genes are likely to develop more advanced disease.
Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the ...biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.
Genotype imputation infers genetic variation that are not experimentally observed. We identified a common informatic error that leads to poor imputation in regions that are inverted across different versions of the reference genome. We provided a heuristics to identify affected variants so that they can be imputed optimally.