Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, ...distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual.
We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets.
The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.
Background Since 2008, multiple studies have reported on copy number variations (CNVs) in schizophrenia. However, many regions are unique events with minimal overlap between studies. This makes it ...difficult to gain a comprehensive overview of all CNVs involved in the etiology of schizophrenia. We performed a systematic CNV study on the basis of a homogeneous genome-wide dataset aiming at all CNVs ≥50 kilobase pair. We complemented this analysis with a review of cytogenetic and chromosomal abnormalities for schizophrenia reported in the literature with the purpose of combining classical genetic findings and our current understanding of genomic variation. Methods We investigated 834 Dutch schizophrenia patients and 672 Dutch control subjects. The CNVs were included if they were detected by QuantiSNP ( http://www.well.ox.ac.uk/QuantiSNP/ ) as well as PennCNV ( http://www.neurogenome.org/cnv/penncnv/ ) and contain known protein coding genes. The integrated identification of CNV regions and cytogenetic loci indicates regions of interest (cytogenetic regions of interest CROIs). Results In total, 2437 CNVs were identified with an average number of 2.1 CNVs/subject for both cases and control subjects. We observed significantly more deletions but not duplications in schizophrenia cases versus control subjects. The CNVs identified coincide with loci previously reported in the literature, confirming well-established schizophrenia CROIs 1q42 and 22q11.2 as well as indicating a potentially novel CROI on chromosome 5q35.1. Conclusions Chromosomal deletions are more prevalent in schizophrenia patients than in healthy subjects and therefore confer a risk factor for pathogenicity. The combination of our CNV data with previously reported cytogenetic abnormalities in schizophrenia provides an overview of potentially interesting regions for positional candidate genes.
Late endosomes and lysosomes of mammalian cells in interphase tend to concentrate in the perinuclear region that harbors the microtubule-organizing center. We have previously reported abnormal ...distribution of these organelles - as judged by reduced percentages of cells displaying pronounced perinuclear accumulation - in mutant fibroblasts lacking BLOC-3 (for `biogenesis of lysosome-related organelles complex 3'). BLOC-3 is a protein complex that contains the products of the genes mutated in Hermansky-Pudlak syndrome types 1 and 4. Here, we developed a method based on image analysis to estimate the extent of organelle clustering in the perinuclear region of cultured cells. Using this method, we corroborated that the perinuclear clustering of late endocytic organelles containing Lamp1 (for `lysosome-associated membrane protein 1') is reduced in BLOC-3-deficient murine fibroblasts, and found that it is apparently normal in fibroblasts deficient in BLOC-1 or BLOC-2, which are another two protein complexes associated with Hermansky-Pudlak syndrome. Wild-type and mutant fibroblasts were transfected to express human LAMP1 fused at its cytoplasmic tail to green fluorescence protein (GFP). At low expression levels, LAMP1-GFP was targeted correctly to late endocytic organelles in both wild-type and mutant cells. High levels of LAMP1-GFP overexpression elicited aberrant aggregation of late endocytic organelles, a phenomenon that probably involved formation of anti-parallel dimers of LAMP1-GFP as it was not observed in cells expressing comparable levels of a non-dimerizing mutant variant, LAMP1-mGFP. To test whether BLOC-3 plays a role in the movement of late endocytic organelles, time-lapse fluorescence microscopy experiments were performed using live cells expressing low levels of LAMP1-GFP or LAMP1-mGFP. Although active movement of late endocytic organelles was observed in both wild-type and mutant fibroblasts, quantitative analyses revealed a relatively lower frequency of microtubule-dependent movement events, either towards or away from the perinuclear region, within BLOC-3-deficient cells. By contrast, neither the duration nor the speed of these microtubule-dependent events seemed to be affected by the lack of BLOC-3 function. These results suggest that BLOC-3 function is required, directly or indirectly, for optimal attachment of late endocytic organelles to microtubule-dependent motors.
Glioblastoma (GBM) is among the most lethal of all cancers. GBM consist of a heterogeneous population of tumor cells among which a tumor-initiating and treatment-resistant subpopulation, here termed ...GBM stem cells, have been identified as primary therapeutic targets. Here, we describe a high-throughput small molecule screening approach that enables the identification and characterization of chemical compounds that are effective against GBM stem cells. The paradigm uses a tissue culture model to enrich for GBM stem cells derived from human GBM resections and combines a phenotype-based screen with gene target-specific screens for compound identification. We used 31,624 small molecules from 7 chemical libraries that we characterized and ranked based on their effect on a panel of GBM stem cell-enriched cultures and their effect on the expression of a module of genes whose expression negatively correlates with clinical outcome: MELK, ASPM, TOP2A, and FOXM1b. Of the 11 compounds meeting criteria for exerting differential effects across cell types used, 4 compounds showed selectivity by inhibiting multiple GBM stem cells-enriched cultures compared with nonenriched cultures: emetine, n-arachidonoyl dopamine, n-oleoyldopamine (OLDA), and n-palmitoyl dopamine. ChemBridge compounds #5560509 and #5256360 inhibited the expression of the 4 mitotic module genes. OLDA, emetine, and compounds #5560509 and #5256360 were chosen for more detailed study and inhibited GBM stem cells in self-renewal assays in vitro and in a xenograft model in vivo. These studies show that our screening strategy provides potential candidates and a blueprint for lead compound identification in larger scale screens or screens involving other cancer types.
MULTILAYER KNOCKOFF FILTER Katsevich, Eugene; Sabatti, Chiara
The annals of applied statistics,
03/2019, Volume:
13, Issue:
1
Journal Article
Peer reviewed
Open access
We tackle the problem of selecting from among a large number of variables those that are “important” for an outcome. We consider situations where groups of variables are also of interest. For ...example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès Ann. Statist. 43 (2015) 2055–2085 and the multilayer testing framework of Barber and Ramdas J. Roy. Statist. Soc. Ser. B 79 (2017) 1247–1268, we introduce the multilayer knockoff filter (MKF).We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We applyMKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
The genome-wide distribution of linkage disequilibrium (LD) determines the strategy for selecting markers for association studies, but it varies between populations. We assayed LD in large samples ...(200 individuals) from each of 11 well-described population isolates and an outbred European-derived sample, using SNP markers spaced across chromosome 22. Most isolates show substantially higher levels of LD than the outbred sample and many fewer regions of very low LD (termed 'holes'). Young isolates known to have had relatively few founders show particularly extensive LD with very few holes; these populations offer substantial advantages for genome-wide association mapping.
Genome-wide association studies (GWAS) have identified >500 common variants associated with quantitative metabolic traits, but in aggregate such variants explain at most 20-30% of the heritable ...component of population variation in these traits. To further investigate the impact of genotypic variation on metabolic traits, we conducted re-sequencing studies in >6,000 members of a Finnish population cohort (The Northern Finland Birth Cohort of 1966 NFBC) and a type 2 diabetes case-control sample (The Finland-United States Investigation of NIDDM Genetics FUSION study). By sequencing the coding sequence and 5' and 3' untranslated regions of 78 genes at 17 GWAS loci associated with one or more of six metabolic traits (serum levels of fasting HDL-C, LDL-C, total cholesterol, triglycerides, plasma glucose, and insulin), and conducting both single-variant and gene-level association tests, we obtained a more complete understanding of phenotype-genotype associations at eight of these loci. At all eight of these loci, the identification of new associations provides significant evidence for multiple genetic signals to one or more phenotypes, and at two loci, in the genes ABCA1 and CETP, we found significant gene-level evidence of association to non-synonymous variants with MAF<1%. Additionally, two potentially deleterious variants that demonstrated significant associations (rs138726309, a missense variant in G6PC2, and rs28933094, a missense variant in LIPC) were considerably more common in these Finnish samples than in European reference populations, supporting our prior hypothesis that deleterious variants could attain high frequencies in this isolated population, likely due to the effects of population bottlenecks. Our results highlight the value of large, well-phenotyped samples for rare-variant association analysis, and the challenge of evaluating the phenotypic impact of such variants.
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. ...In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the international classification of diseases (ICD), the directed acyclic graph ...structure of the gene ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any prespecified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO.
Causal inference in genetic trio studies Bates, Stephen; Sesia, Matteo; Sabatti, Chiara ...
Proceedings of the National Academy of Sciences - PNAS,
09/2020, Volume:
117, Issue:
39
Journal Article
Peer reviewed
Open access
We introduce a method to draw causal inferences—inferences immune to all possible confounding—from genetic data that include parents and offspring. Causal conclusions are possible with these data ...because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed digital twin test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional nontrio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes. We compare our method to the widely used transmission disequilibrium test and demonstrate enhanced power and localization.