Abstract
Motivation
Combination therapies have been widely used to treat cancers. However, it is cost and time consuming to experimentally screen synergistic drug pairs due to the enormous number of ...possible drug combinations. Thus, computational methods have become an important way to predict and prioritize synergistic drug pairs.
Results
We proposed a Deep Tensor Factorization (DTF) model, which integrated a tensor factorization method and a deep neural network (DNN), to predict drug synergy. The former extracts latent features from drug synergy information while the latter constructs a binary classifier to predict the drug synergy status. Compared to the tensor-based method, the DTF model performed better in predicting drug synergy. The area under precision-recall curve (PR AUC) was 0.58 for DTF and 0.24 for the tensor method. We also compared the DTF model with DeepSynergy and logistic regression models, and found that the DTF outperformed the logistic regression model and achieved similar performance as DeepSynergy using several performance metrics for classification task. Applying the DTF model to predict missing entries in our drug–cell-line tensor, we identified novel synergistic drug combinations for 10 cell lines from the 5 cancer types. A literature survey showed that some of these predicted drug synergies have been identified in vivo or in vitro. Thus, the DTF model could be a valuable in silico tool for prioritizing novel synergistic drug combinations.
Availability and implementation
Source code and data are available at https://github.com/ZexuanSun/DTF-Drug-Synergy.
Supplementary information
Supplementary data are available at Bioinformatics online.
Young women with breast cancer have disproportionately poor clinical outcomes compared to their older counterparts. The underlying biological differences behind this age-dependent disparity are still ...unknown and warrant investigation. Recently, the tumor immune landscape has received much attention for its prognostic value and therapeutic targets. The differential tumor immune landscape between age groups in breast cancer has not yet been characterized, and may contribute to the age-related differences in clinical outcomes. Computational deconvolution was used to quantify abundance of immune cell types from bulk transcriptome profiles of breast cancer patients from two independent datasets. No significant differences in immune cell composition that were consistent in the two cohorts were found between the young and old age groups. Regardless of absence of significant differences, the higher tumor infiltration of several immune cell types, such as CD8+ T and CD4+ T cells, was associated with better clinical outcomes in the young but not in the old age group. Mutational signatures analysis showed signatures previously not found in breast cancer to be associated with tumor-infiltrating lymphocyte (TIL) levels in the young age group, whereas in the old group, all significant signatures were those previously found in breast cancer. Pathway analysis revealed different gene sets associated with TIL levels for each age group from the two cohorts. Overall, our results show trends towards better clinical outcomes for high TIL levels, especially CD8+ T cells, but only in the young age group. Furthermore, our work suggests that the underlying biological differences may involve multiple levels of tumor physiology.
Spatial transcriptomics has gained popularity over the past decade due to its ability to evaluate transcriptome data while preserving spatial information. Cell segmentation is a crucial step in ...spatial transcriptomic analysis, as it enables the avoidance of unpredictable tissue disentanglement steps. Although high-quality cell segmentation algorithms can aid in the extraction of valuable data, traditional methods are frequently non-spatial, do not account for spatial information efficiently, and perform poorly when confronted with the problem of spatial transcriptome cell segmentation with varying shapes. In this study, we propose ST-CellSeg, an image-based machine learning method for spatial transcriptomics that uses manifold for cell segmentation and is novel in its consideration of multi-scale information. We first construct a fully connected graph which acts as a spatial transcriptomic manifold. Using multi-scale data, we then determine the low-dimensional spatial probability distribution representation for cell segmentation. Using the adjusted Rand index (ARI), normalized mutual information (NMI), and Silhouette coefficient (SC) as model performance measures, the proposed algorithm significantly outperforms baseline models in selected datasets and is efficient in computational complexity.
Abstract The composition of cell-type is a key indicator of health. Advancements in bulk gene expression data curation, single cell RNA-sequencing technologies, and computational deconvolution ...approaches offer a new perspective to learn about the composition of different cell types in a quick and affordable way. In this study, we developed a quantile regression and deep learning-based method called Neural Network Immune Contexture Estimator (NNICE) to estimate the cell type abundance and its uncertainty by automatically deconvolving bulk RNA-seq data. The proposed NNICE model was able to successfully recover ground-truth cell type fraction values given unseen bulk mixture gene expression profiles from the same dataset it was trained on. Compared with baseline methods, NNICE achieved better performance on deconvolve both pseudo-bulk gene expressions (Pearson correlation R = 0.9) and real bulk gene expression data (Pearson correlation R = 0.9) across all cell types. In conclusion, NNICE combines statistic inference with deep learning to provide accurate and interpretable cell type deconvolution from bulk gene expression.
COVID-19 is a newly identified disease, which is very contagious and has been rapidly spreading across different countries around the world, calling for rapid and accurate diagnosis tools. Chest CT ...imaging has been widely used in clinical practice for disease diagnosis, but image reading is still a time-consuming work. We aim to integrate an image preprocessing technology for anomaly detection with supervised deep learning for chest CT imaging-based COVID-19 diagnosis. In this study, a matrix profile technique was introduced to CT image anomaly detection in two levels. At one-dimensional level, CT images were simply flatted and transformed to a one-dimensional vector so that the matrix profile algorithm could be implemented for them directly. At two-dimensional level,a matrix profile was calculated in a sliding window way for every segment in the image. An anomaly severity score (CT-SS) was calculated, and the difference of the CT-SS between the COVID-19 CT images and Non-COVID-19 CT images was tested. A sparse anomaly mask was calculated and applied to penalize the pixel values of each image. The anomaly weighted images were then used to train standard DenseNet deep learning models to distinguish the COVID-19 CT from Non-COVID-19 CT images. A VGG19 model was used as a baseline model for comparison. Although extra finetuning needs to be done manually, the one-dimensional matrix profile method could identify the anomalies successfully. Using the two-dimensional matrix profiling method, CT-SS and anomaly weighted image can be successfully generated for each image. The CT-SS significantly differed among the COVID-19 CT images and Non-COVID-19 CT images (p-value <; 0.05 ). Furthermore, we identified a potential causal association between the number of underlying diseases of a COVID-19 patient and the severity of the disease through statistical mediation analysis. Compared to the raw images, the anomaly weighted images showed generally better performance in training the DenseNet models with different architectures for diagnosing COVID-19, which was validated using two publicly available COVID-19 lung CT image datasets. The metric Area Under the Curve(AUC) on one dataset were 0.7799(weighted)vs. 0.7391(unweighted), 0.7812(weighted) vs. 0.7410(unweighted), 0.7780(weighted) vs. 0.7399(unweighted), 0.7045(weighted) vs. 0.6910(unweighted) for DenseNet121, DenseNet169, DenseNet201, and the baseline model VGG19, respectively. The same trend was observed using another independent dataset. The significant results revealed the critical value of using this existing state-of-the-art algorithm for image anomaly detection. Furthermore, the end-to-end model structure has the potential to work as a rapid tool for clinical imaging-based diagnosis.
We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis ...testing problem and employ a binomial-binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall P-value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to 'accept or reject the candidates' provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/.
Background:
Traditional therapeutics targeting Alzheimer’s disease (AD)-related subpathologies have so far proved ineffective. Drug repurposing, a more effective strategy that aims to find new ...indications for existing drugs against other diseases, offers benefits in AD drug development. In this study, we aim to identify potential anti-AD agents through enrichment analysis of drug-induced transcriptional profiles of pathways based on AD-associated risk genes identified from genome-wide association analyses (GWAS) and single-cell transcriptomic studies.
Methods:
We systematically constructed four gene lists (972 risk genes) from GWAS and single-cell transcriptomic studies and performed functional and genes overlap analyses in Enrichr tool. We then used a comprehensive drug repurposing tool Gene2Drug by combining drug-induced transcriptional responses with the associated pathways to compute candidate drugs from each gene list. Prioritized potential candidates (eight drugs) were further assessed with literature review.
Results:
The genomic-based gene lists contain late-onset AD associated genes (BIN1, ABCA7, APOE, CLU, and PICALM) and clinical AD drug targets (TREM2, CD33, CHRNA2, PRSS8, ACE, TKT, APP, and GABRA1). Our analysis identified eight AD candidate drugs (ellipticine, alsterpaullone, tomelukast, ginkgolide A, chrysin, ouabain, sulindac sulfide and lorglumide), four of which (alsterpaullone, ginkgolide A, chrysin and ouabain) have shown repurposing potential for AD validated by their preclinical evidence and moderate toxicity profiles from literature. These support the value of pathway-based prioritization based on the disease risk genes from GWAS and scRNA-seq data analysis.
Conclusion:
Our analysis strategy identified some potential drug candidates for AD. Although the drugs still need further experimental validation, the approach may be applied to repurpose drugs for other neurological disorders using their genomic information identified from large-scale genomic studies.
Recent studies showed that somatic cancer mutations target genes that are in specific signaling and cellular pathways. However, in each patient only a few of the pathway genes are mutated. Current ...approaches consider only existing pathways and ignore the topology of the pathways. For this reason, new efforts have been focused on identifying significantly mutated subnetworks and associating them with cancer characteristics. We applied two well-established network analysis approaches to identify significantly mutated subnetworks in the breast cancer genome. We took network topology into account for measuring the mutation similarity of a gene-pair to allow us to infer the significantly mutated subnetworks. Our goals are to evaluate whether the identified subnetworks can be used as biomarkers for predicting breast cancer patient survival and provide the potential mechanisms of the pathways enriched in the subnetworks, with the aim of improving breast cancer treatment. Using the copy number alteration (CNA) datasets from the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) study, we identified a significantly mutated yet clinically and functionally relevant subnetwork using two graph-based clustering algorithms. The mutational pattern of the subnetwork is significantly associated with breast cancer survival. The genes in the subnetwork are significantly enriched in retinol metabolism KEGG pathway. Our results show that breast cancer treatment with retinoids may be a potential personalized therapy for breast cancer patients since the CNA patterns of the breast cancer patients can imply whether the retinoids pathway is altered. We also showed that applying multiple bioinformatics algorithms at the same time has the potential to identify new network-based biomarkers, which may be useful for stratifying cancer patients for choosing optimal treatments.
Artificial intelligence-based unsupervised deep learning (DL) is widely used to mine multimodal big data. However, there are few applications of this technology to cancer genomics. We aim to develop ...DL models to extract deep features from the breast cancer gene expression data and copy number alteration (CNA) data separately and jointly. We hypothesize that the deep features are associated with patients' clinical characteristics and outcomes. Two unsupervised denoising autoencoders (DAs) were developed to extract deep features from TCGA (The Cancer Genome Atlas) breast cancer gene expression and CNA data separately and jointly. A heat map was used to view and cluster patients into subgroups based on these DL features. Fisher's exact test and Pearson' Chi-square test were applied to test the associations of patients' groups and clinical information. Survival differences between the groups were evaluated by Kaplan⁻Meier (KM) curves. Associations between each of the features and patient's overall survival were assessed using Cox's proportional hazards (COX-PH) model and a risk score for each feature set from the different omics data sets was generated from the survival regression coefficients. The risk scores for each feature set were binarized into high- and low-risk patient groups to evaluate survival differences using KM curves. Furthermore, the risk scores were traced back to their gene level DAs weights so that the three gene lists for each of the genomic data points were generated to perform gene set enrichment analysis. Patients were clustered into two groups based on concatenated features from the gene expression and CNA data and these two groups showed different overall survival rates (
-value = 0.049) and different ER (Estrogen receptor) statuses (
-value = 0.002, OR (odds ratio) = 0.626). All the risk scores from the gene expression and CNA data and their concatenated one were significantly associated with breast cancer survival. The patients with the high-risk group were significantly associated with patients' worse outcomes (
-values ≤ 0.0023). The concatenated risk score was enriched by the AMP-activated protein kinase (AMPK) signaling pathway, the regulation of DNA-templated transcription, the regulation of nucleic acid-templated transcription, the regulation of apoptotic process, the positive regulation of gene expression, the positive regulation of cell proliferation, heart morphogenesis, the regulation of cellular macromolecule biosynthetic process, with FDR (false discovery rate) less than 0.05. We confirmed DAs can effectively extract meaningful genomic features from genomic data and concatenating multiple data sources can improve the significance of the features associated with breast cancer patients' clinical characteristics and outcomes.
Converting molecules into computer-interpretable features with rich molecular information is a core problem of data-driven machine learning applications in chemical and drug-related tasks. Generally ...speaking, there are global and local features to represent a given molecule. As most algorithms have been developed based on one type of feature, a remaining bottleneck is to combine both feature sets for advanced molecule-based machine learning analysis. Here, we explored a novel analytical framework to make embeddings of the molecular features and apply them in the clustering of a large number of small molecules.
In this novel framework, we first introduced a principal component analysis method encoding the molecule-specific atom and bond information. We then used a variational autoencoder (AE)-based method to make embeddings of the global chemical properties and the local atom and bond features. Next, using the embeddings from the encoded local and global features, we implemented and compared several unsupervised clustering algorithms to group the molecule-specific embeddings. The number of clusters was treated as a hyper-parameter and determined by the Silhouette method. Finally, we evaluated the corresponding results using three internal indices. Applying the analysis framework to a large chemical library of more than 47,000 molecules, we successfully identified 50 molecular clusters using the K-means method with 32 embeddings based on the AE method. We visualized the clustering result via t-SNE for the overall distribution of molecules and the similarity maps for the structural analysis of randomly selected cluster-specific molecules.
This study developed a novel analytical framework that comprises a feature engineering scheme for molecule-specific atomic and bonding features and a deep learning-based embedding strategy for different molecular features. By applying the identified embeddings, we show their usefulness for clustering a large molecule dataset. Our novel analytic algorithms can be applied to any virtual library of chemical compounds with diverse molecular structures. Hence, these tools have the potential of optimizing drug discovery, as they can decrease the number of compounds to be screened in any drug screening campaign.