In large collections of tumor samples, it has been observed that sets of genes that are commonly involved in the same cancer pathways tend not to occur mutated together in the same patient. Such gene ...sets form mutually exclusive patterns of gene alterations in cancer genomic data. Computational approaches that detect mutually exclusive gene sets, rank and test candidate alteration patterns by rewarding the number of samples the pattern covers and by punishing its impurity, i.e., additional alterations that violate strict mutual exclusivity. However, the extant approaches do not account for possible observation errors. In practice, false negatives and especially false positives can severely bias evaluation and ranking of alteration patterns. To address these limitations, we develop a fully probabilistic, generative model of mutual exclusivity, explicitly taking coverage, impurity, as well as error rates into account, and devise efficient algorithms for parameter estimation and pattern ranking. Based on this model, we derive a statistical test of mutual exclusivity by comparing its likelihood to the null model that assumes independent gene alterations. Using extensive simulations, the new test is shown to be more powerful than a permutation test applied previously. When applied to detect mutual exclusivity patterns in glioblastoma and in pan-cancer data from twelve tumor types, we identify several significant patterns that are biologically relevant, most of which would not be detected by previous approaches. Our statistical modeling framework of mutual exclusivity provides increased flexibility and power to detect cancer pathways from genomic alteration data in the presence of noise. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2-5.
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells ...analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Abstract
Computational models for drug sensitivity prediction have the potential to significantly improve personalized cancer medicine. Drug sensitivity assays, combined with profiling of cancer cell ...lines and drugs become increasingly available for training such models. Multiple methods were proposed for predicting drug sensitivity from cancer cell line features, some in a multi-task fashion. So far, no such model leveraged drug inhibition profiles. Importantly, multi-task models require a tailored approach to model interpretability. In this work, we develop DEERS, a neural network recommender system for kinase inhibitor sensitivity prediction. The model utilizes molecular features of the cancer cell lines and kinase inhibition profiles of the drugs. DEERS incorporates two autoencoders to project cell line and drug features into 10-dimensional hidden representations and a feed-forward neural network to combine them into response prediction. We propose a novel interpretability approach, which in addition to the set of modeled features considers also the genes and processes outside of this set. Our approach outperforms simpler matrix factorization models, achieving R
$$=$$
=
0.82 correlation between true and predicted response for the unseen cell lines. The interpretability analysis identifies 67 biological processes that drive the cell line sensitivity to particular compounds. Detailed case studies are shown for PHA-793887, XMD14-99 and Dabrafenib.
Drug sensitivity prediction constitutes one of the main challenges in personalized medicine. Critically, the sensitivity of cancer cells to treatment depends on an unknown subset of a large number of ...biological features. Here, we compare standard, data-driven feature selection approaches to feature selection driven by prior knowledge of drug targets, target pathways, and gene expression signatures. We asses these methodologies on Genomics of Drug Sensitivity in Cancer (GDSC) dataset, evaluating 2484 unique models. For 23 drugs, better predictive performance is achieved when the features are selected according to prior knowledge of drug targets and pathways. The best correlation of observed and predicted response using the test set is achieved for Linifanib (r = 0.75). Extending the drug-dependent features with gene expression signatures yields the most predictive models for 60 drugs, with the best performing example of Dabrafenib. For many compounds, even a very small subset of drug-related features is highly predictive of drug sensitivity. Small feature sets selected using prior knowledge are more predictive for drugs targeting specific genes and pathways, while models with wider feature sets perform better for drugs affecting general cellular mechanisms. Appropriate feature selection strategies facilitate the development of interpretable models that are indicative for therapy design.
Cancer aggressiveness and its effect on patient survival depends on mutations in the tumor genome. Epistatic interactions between the mutated genes may guide the choice of anticancer therapy and set ...predictive factors of its success. Inhibitors targeting synthetic lethal partners of genes mutated in tumors are already utilized for efficient and specific treatment in the clinic. The space of possible epistatic interactions, however, is overwhelming, and computational methods are needed to limit the experimental effort of validating the interactions for therapy and characterizing their biomarkers. Here, we introduce SurvLRT, a statistical likelihood ratio test for identifying epistatic gene pairs and triplets from cancer patient genomic and survival data. Compared to established approaches, SurvLRT performed favorable in predicting known, experimentally verified synthetic lethal partners of PARP1 from TCGA data. Our approach is the first to test for epistasis between triplets of genes to identify biomarkers of synthetic lethality-based therapy. SurvLRT proved successful in identifying the known gene TP53BP1 as the biomarker of success of the therapy targeting PARP in BRCA1 deficient tumors. Search for other biomarkers for the same interaction revealed a region whose deletion was a more significant biomarker than deletion of TP53BP1. With the ability to detect not only pairwise but twelve different types of triple epistasis, applicability of SurvLRT goes beyond cancer therapy, to the level of characterization of shapes of fitness landscapes.
Metastases are the main reason for cancer-related deaths. Initiation of metastases, where newly seeded tumor cells expand into colonies, presents a tremendous bottleneck to metastasis formation. ...Despite its importance, a quantitative description of metastasis initiation and its clinical implications is lacking. Here, we set theoretical grounds for the metastatic bottleneck with a simple stochastic model. The model assumes that the proliferation-to-death rate ratio for the initiating metastatic cells increases when they are surrounded by more of their kind. For a total of 159,191 patients across 13 cancer types, we found that a single cell has an extremely low median probability of successful seeding of the order of 10-8. With increasing colony size, a sharp transition from very unlikely to very likely successful metastasis initiation occurs. The median metastatic bottleneck, defined as the critical colony size that marks this transition, was between 10 and 21 cells. We derived the probability of metastasis occurrence and patient outcome based on primary tumor size at diagnosis and tumor type. The model predicts that the efficacy of patient treatment depends on the primary tumor size but even more so on the severity of the metastatic bottleneck, which is estimated to largely vary between patients. We find that medical interventions aiming at tightening the bottleneck, such as immunotherapy, can be much more efficient than therapies that decrease overall tumor burden, such as chemotherapy.
Antimicrobial peptides emerge as compounds that can alleviate the global health hazard of antimicrobial resistance, prompting a need for novel computational approaches to peptide generation. Here, we ...propose HydrAMP, a conditional variational autoencoder that learns lower-dimensional, continuous representation of peptides and captures their antimicrobial properties. The model disentangles the learnt representation of a peptide from its antimicrobial conditions and leverages parameter-controlled creativity. HydrAMP is the first model that is directly optimized for diverse tasks, including unconstrained and analogue generation and outperforms other approaches in these tasks. An additional preselection procedure based on ranking of generated peptides and molecular dynamics simulations increases experimental validation rate. Wet-lab experiments on five bacterial strains confirm high activity of nine peptides generated as analogues of clinically relevant prototypes, as well as six analogues of an inactive peptide. HydrAMP enables generation of diverse and potent peptides, making a step towards resolving the antimicrobial resistance crisis.
Copy number alterations constitute important phenomena in tumor evolution. Whole genome single-cell sequencing gives insight into copy number profiles of individual cells, but is highly noisy. Here, ...we propose CONET, a probabilistic model for joint inference of the evolutionary tree on copy number events and copy number calling. CONET employs an efficient, regularized MCMC procedure to search the space of possible model structures and parameters. We introduce a range of model priors and penalties for efficient regularization. CONET reveals copy number evolution in two breast cancer samples, and outperforms other methods in tree reconstruction, breakpoint identification and copy number calling.
Establishing the cancer type and site of origin is important in determining the most appropriate course of treatment for cancer patients. Patients with cancer of unknown primary, where the site of ...origin cannot be established from an examination of the metastatic cancer cells, typically have poor survival. Here, we evaluate the potential and limitations of utilising gene alteration data from tumour DNA to identify cancer types.
Using sequenced tumour DNA downloaded via the cBioPortal for Cancer Genomics, we collected the presence or absence of calls for gene alterations for 6640 tumour samples spanning 28 cancer types, as predictive features. We employed three machine-learning techniques, namely linear support vector machines with recursive feature selection, L
-regularised logistic regression and random forest, to select a small subset of gene alterations that are most informative for cancer-type prediction. We then evaluated the predictive performance of the models in a comparative manner.
We found the linear support vector machine to be the most predictive model of cancer type from gene alterations. Using only 100 somatic point-mutated genes for prediction, we achieved an overall accuracy of 49.4±0.4 % (95 % confidence interval). We observed a marked increase in the accuracy when copy number alterations are included as predictors. With a combination of somatic point mutations and copy number alterations, a mere 50 genes are enough to yield an overall accuracy of 77.7±0.3 %.
A general cancer diagnostic tool that utilises either only somatic point mutations or only copy number alterations is not sufficient for distinguishing a broad range of cancer types. The combination of both gene alteration types can dramatically improve the performance.