Abstract Motivation The inference of cellular compositions from bulk and spatial transcriptomics data increasingly complements data analyses. Multiple computational approaches were suggested and ...recently, machine learning techniques were developed to systematically improve estimates. Such approaches allow to infer additional, less abundant cell types. However, they rely on training data which do not capture the full biological diversity encountered in transcriptomics analyses; data can contain cellular contributions not seen in the training data and as such, analyses can be biased or blurred. Thus, computational approaches have to deal with unknown, hidden contributions. Moreover, most methods are based on cellular archetypes which serve as a reference; e.g. a generic T-cell profile is used to infer the proportion of T-cells. It is well known that cells adapt their molecular phenotype to the environment and that pre-specified cell archetypes can distort the inference of cellular compositions. Results We propose Adaptive Digital Tissue Deconvolution (ADTD) to estimate cellular proportions of pre-selected cell types together with possibly unknown and hidden background contributions. Moreover, ADTD adapts prototypic reference profiles to the molecular environment of the cells, which further resolves cell-type specific gene regulation from bulk transcriptomics data. We verify this in simulation studies and demonstrate that ADTD improves existing approaches in estimating cellular compositions. In an application to bulk transcriptomics data from breast cancer patients, we demonstrate that ADTD provides insights into cell-type specific molecular differences between breast cancer subtypes. Availability and implementation A python implementation of ADTD and a tutorial are available at Gitlab and zenodo (doi:10.5281/zenodo.7548362).
Abstract Summary Drug response is conventionally measured at the cell level, often quantified by metrics like IC50. However, to gain a deeper understanding of drug response, cellular outcomes need to ...be understood in terms of pathway perturbation. This perspective leads us to recognize a challenge posed by the gap between two widely used large-scale databases, LINCS L1000 and GDSC, measuring drug response at different levels—L1000 captures information at the gene expression level, while GDSC operates at the cell line level. Our study aims to bridge this gap by integrating the two databases through transfer learning, focusing on condition-specific perturbations in gene interactions from L1000 to interpret drug response integrating both gene and cell levels in GDSC. This transfer learning strategy involves pretraining on the transcriptomic-level L1000 dataset, with parameter-frozen fine-tuning to cell line-level drug response. Our novel condition-specific gene–gene attention (CSG2A) mechanism dynamically learns gene interactions specific to input conditions, guided by both data and biological network priors. The CSG2A network, equipped with transfer learning strategy, achieves state-of-the-art performance in cell line-level drug response prediction. In two case studies, well-known mechanisms of drugs are well represented in both the learned gene–gene attention and the predicted transcriptomic profiles. This alignment supports the modeling power in terms of interpretability and biological relevance. Furthermore, our model’s unique capacity to capture drug response in terms of both pathway perturbation and cell viability extends predictions to the patient level using TCGA data, demonstrating its expressive power obtained from both gene and cell levels. Availability and implementation The source code for the CSG2A network is available at https://github.com/eugenebang/CSG2A.
Abstract Motivation Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform ...this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. Results We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. Availability and implementation The tool is implemented in python and available at https://github.com/ratschlab/path-fa
Abstract Motivation Analysis of time series transcriptomics data from clinical trials is challenging. Such studies usually profile very few time points from several individuals with varying response ...patterns and dynamics. Current methods for these datasets are mainly based on linear, global orderings using visit times which do not account for the varying response rates and subgroups within a patient cohort. Results We developed a new method that utilizes multi-commodity flow algorithms for trajectory inference in large scale clinical studies. Recovered trajectories satisfy individual-based timing restrictions while integrating data from multiple patients. Testing the method on multiple drug datasets demonstrated an improved performance compared to prior approaches suggested for this task, while identifying novel disease subtypes that correspond to heterogeneous patient response patterns. Availability and implementation The source code and instructions to download the data have been deposited on GitHub at https://github.com/euxhenh/Truffle.
Abstract Motivation Predicting cancer drug response requires a comprehensive assessment of many mutations present across a tumor genome. While current drug response models generally use a binary ...mutated/unmutated indicator for each gene, not all mutations in a gene are equivalent. Results Here, we construct and evaluate a series of predictive models based on leading methods for quantitative mutation scoring. Such methods include VEST4 and CADD, which score the impact of a mutation on gene function, and CHASMplus, which scores the likelihood a mutation drives cancer. The resulting predictive models capture cellular responses to dabrafenib, which targets BRAF-V600 mutations, whereas models based on binary mutation status do not. Performance improvements generalize to other drugs, extending genetic indications for PIK3CA, ERBB2, EGFR, PARP1, and ABL1 inhibitors. Introducing quantitative mutation features in drug response models increases performance and mechanistic understanding. Availability and implementation Code and example datasets are available at https://github.com/pgwall/qms.
Abstract
Gaining knowledge and actionable insights from complex, high-dimensional and heterogeneous biomedical data remains a key challenge in transforming health care. Various types of data have ...been emerging in modern biomedical research, including electronic health records, imaging, -omics, sensor data and text, which are complex, heterogeneous, poorly annotated and generally unstructured. Traditional data mining and statistical learning approaches typically need to first perform feature engineering to obtain effective and more robust features from those data, and then build prediction or clustering models on top of them. There are lots of challenges on both steps in a scenario of complicated data and lacking of sufficient domain knowledge. The latest advances in deep learning technologies provide new effective paradigms to obtain end-to-end learning models from complex data. In this article, we review the recent literature on applying deep learning technologies to advance the health care domain. Based on the analyzed work, we suggest that deep learning approaches could be the vehicle for translating big biomedical data into improved human health. However, we also note limitations and needs for improved methods development and applications, especially in terms of ease-of-understanding for domain experts and citizen scientists. We discuss such challenges and suggest developing holistic and meaningful interpretable architectures to bridge deep learning models and human interpretability.
Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have ...enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimistic results. With this review, we present ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that we observed hundreds of times in multiple bioinformatics projects. We believe our ten suggestions can strongly help any machine learning practitioner to carry on a successful project in computational biology and related sciences.
The birth of wireless communication systems nearly 1 century ago has transformed and redefined the way humans communicate and interact. This transformation, which has opened boundaries between ...societies and eliminated cultural barriers, has evolved from the original paradigm of wireless electromagnetic communication systems. This paradigm has experienced numerous evolutionary developments that have not only brought along seamless connectivity for human interactions but also communication between machines and devices.
Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein ...sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning.
COVID-19 outbreak has put the whole world in an unprecedented difficult situation bringing life around the world to a frightening halt and claiming thousands of lives. Due to COVID-19's spread in 212 ...countries and territories and increasing numbers of infected cases and death tolls mounting to 5,212,172 and 334,915 (as of May 22 2020), it remains a real threat to the public health system. This paper renders a response to combat the virus through Artificial Intelligence (AI). Some Deep Learning (DL) methods have been illustrated to reach this goal, including Generative Adversarial Networks (GANs), Extreme Learning Machine (ELM), and Long/Short Term Memory (LSTM). It delineates an integrated bioinformatics approach in which different aspects of information from a continuum of structured and unstructured data sources are put together to form the user-friendly platforms for physicians and researchers. The main advantage of these AI-based platforms is to accelerate the process of diagnosis and treatment of the COVID-19 disease. The most recent related publications and medical reports were investigated with the purpose of choosing inputs and targets of the network that could facilitate reaching a reliable Artificial Neural Network-based tool for challenges associated with COVID-19. Furthermore, there are some specific inputs for each platform, including various forms of the data, such as clinical data and medical imaging which can improve the performance of the introduced approaches toward the best responses in practical applications.