The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches ...for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discriminatory power of a prediction rule. Specifically, we propose a gradient boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.
Statistical boosting is a computational approach to select and estimate interpretable prediction models for high-dimensional biomedical data, leading to implicit regularization and variable selection ...when combined with early stopping. Traditionally, the set of base-learners is fixed for all iterations and consists of simple regression learners including only one predictor variable at a time. Furthermore, the number of iterations is typically tuned by optimizing the predictive performance, leading to models which often include unnecessarily large numbers of noise variables. We propose three consecutive extensions of classical component-wise gradient boosting. In the first extension, called Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. To compensate for the larger flexibility, the ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm. As the second extension, Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data. In a third extension, called Adaptive Subspace Boosting (AdaSubBoost), an adaptive random preselection of base-learners is considered, focusing on base-learners which have proven to be predictive in previous iterations. Simulation results show that the multivariable updates in the three subspace algorithms are particularly beneficial in cases of high correlations among signal covariates. In several biomedical applications the proposed algorithms tend to yield sparser models than classical statistical boosting, while showing a very competitive predictive performance also compared to penalized regression approaches like the (relaxed) lasso and the elastic net. The proposed randomized boosting approaches with multivariable base-learners are promising extensions of statistical boosting, particularly suited for highly-correlated and sparse high-dimensional settings. The incorporated selection of base-learners via information criteria induces automatic stopping of the algorithms, promoting sparser and more interpretable prediction models.
The infection fatality rate (IFR) of the Coronavirus Disease 2019 (COVID-19) is one of the most discussed figures in the context of this pandemic. In contrast to the case fatality rate (CFR), the IFR ...depends on the total number of infected individuals - not just on the number of confirmed cases. In order to estimate the IFR, several seroprevalence studies have been or are currently conducted.
Using German COVID-19 surveillance data and age-group specific IFR estimates from multiple international studies, this work investigates time-dependent variations in effective IFR over the course of the pandemic. Three different methods for estimating (effective) IFRs are presented: (a) population-averaged IFRs based on the assumption that the infection risk is independent of age and time, (b) effective IFRs based on the assumption that the age distribution of confirmed cases approximately reflects the age distribution of infected individuals, and (c) effective IFRs accounting for age- and time-dependent dark figures of infections.
Effective IFRs in Germany are estimated to vary over time, as the age distributions of confirmed cases and estimated infections are changing during the course of the pandemic. In particular during the first and second waves of infections in spring and autumn/winter 2020, there has been a pronounced shift in the age distribution of confirmed cases towards older age groups, resulting in larger effective IFR estimates. The temporary increase in effective IFR during the first wave is estimated to be smaller but still remains when adjusting for age- and time-dependent dark figures. A comparison of effective IFRs with observed CFRs indicates that a substantial fraction of the time-dependent variability in observed mortality can be explained by changes in the age distribution of infections. Furthermore, a vanishing gap between effective IFRs and observed CFRs is apparent after the first infection wave, while an increasing gap can be observed during the second wave.
The development of estimated effective IFR and observed CFR reflects the changing age distribution of infections over the course of the COVID-19 pandemic in Germany. Further research is warranted to obtain timely age-stratified IFR estimates, particularly in light of new variants of the virus.
Deep learning is currently the most successful machine learning technique in a wide range of application areas and has recently been applied successfully in drug discovery research to predict ...potential drug targets and to screen for active molecules. However, due to (1) the lack of large-scale studies, (2) the compound series bias that is characteristic of drug discovery datasets and (3) the hyperparameter selection bias that comes with the high number of potential deep learning architectures, it remains unclear whether deep learning can indeed outperform existing computational methods in drug discovery tasks. We therefore assessed the performance of several deep learning methods on a large-scale drug discovery dataset and compared the results with those of other machine learning and target prediction methods. To avoid potential biases from hyperparameter selection or compound series, we used a nested cluster-cross-validation strategy. We found (1) that deep learning methods significantly outperform all competing methods and (2) that the predictive performance of deep learning is in many cases comparable to that of tests performed in wet labs (
,
assays).
We provide a detailed hands-on tutorial for the R add-on package
mboost
. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares ...estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how
mboost
can be used to fit interpretable models of different complexity. As an example we use
mboost
to predict the body fat based on anthropometric measurements throughout the tutorial.
Generalized additive models for location, scale and shape are a flexible class of regression models that allow to model multiple parameters of a distribution function, such as the mean and the ...standard deviation, simultaneously. With the R package gamboostLSS, we provide a boosting method to fit these models. Variable selection and model choice are naturally available within this regularized regression framework. To introduce and illustrate the R package gamboostLSS and its infrastructure, we use a data set on stunted growth in India. In addition to the specification and application of the model itself, we present a variety of convenience functions, including methods for tuning parameter selection, prediction and visualization of results. The package gamboostLSS is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=gamboostLSS.
We consider a retrospective modelling approach for estimating effective reproduction numbers based on death counts during the first year of the COVID-19 pandemic in Germany. The proposed Bayesian ...hierarchical model incorporates splines to estimate reproduction numbers flexibly over time while adjusting for varying effective infection fatality rates. The approach also provides estimates of dark figures regarding undetected infections. Results for Germany illustrate that our estimates based on death counts are often similar to classical estimates based on confirmed cases; however, considering death counts allows to disentangle effects of adapted testing policies from transmission dynamics. In particular, during the second wave of infections, classical estimates suggest a flattening infection curve following the "lockdown light" in November 2020, while our results indicate that infections continued to rise until the "second lockdown" in December 2020. This observation is associated with more stringent testing criteria introduced concurrently with the "lockdown light", which is reflected in subsequently increasing dark figures of infections estimated by our model. In light of progressive vaccinations, shifting the focus from modelling confirmed cases to reported deaths with the possibility to incorporate effective infection fatality rates might be of increasing relevance for the future surveillance of the pandemic.
In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a ...straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections' problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets' importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations.
Purpose
Postoperative delirium (POD) is an often unrecognized adverse event in older people after surgery. The aim of this subgroup analysis of the
PR
e-
O
perative
P
rediction of postoperative
DE
...lirium by appropriate
SC
reening (PROPDESC) trial in patients aged 70 years and older was to identify preoperative risk factors and the impact of POD on length of stay (LOS) in intensive care unit (ICU) and hospital.
Methods
Of the total 1097 patients recruited at a German university hospital (from September 2018 to October 2019) in the PROPDESC prospective observational study, 588 patients aged 70 years and older (mean age 77.2 ± 4.7 years) were included for subgroup analysis. The primary endpoint POD was considered positive if one of the following tests were positive on any of the five postoperative visit days: Confusion Assessment Method for ICU (CAM-ICU), Confusion Assessment Method (CAM), 4'A's (4AT) and Delirium Observation Scale (DOS). Trained doctoral students carried out these visitations and additionally the nursing staff were interviewed for completion of the DOS. To evaluate the independent effect of POD on LOS in ICU and in hospital, a multi-variable linear regression analysis was performed.
Results
The POD incidence was 25.9%. The results of our model showed POD as an independent predictor for a prolonged LOS in ICU (36%; 95% CI 4–78%; < 0.001) and in hospital (22%; 95% CI 4–43%; < 0.001).
Conclusion
POD has an independent impact on LOS in ICU and in hospital. Based on the effect of POD for the elderly, a standardized risk screening is required.
Trail registration
German Registry for Clinical Studies: DRKS00015715.
Medical decision making based on quantitative test results depends on reliable reference intervals, which represent the range of physiological test results in a healthy population. Current methods ...for the estimation of reference limits focus either on modelling the age-dependent dynamics of different analytes directly in a prospective setting or the extraction of independent distributions from contaminated data sources, e.g. data with latent heterogeneity due to unlabeled pathologic cases. In this article, we propose a new method to estimate indirect reference limits with non-linear dependencies on covariates from contaminated datasets by combining the framework of mixture models and distributional regression.
Simulation results based on mixtures of Gaussian and gamma distributions suggest accurate approximation of the true quantiles that improves with increasing sample size and decreasing overlap between the mixture components. Due to the high flexibility of the framework, initialization of the algorithm requires careful considerations regarding appropriate starting weights. Estimated quantiles from the extracted distribution of healthy hemoglobin concentration in boys and girls provide clinically useful pediatric reference limits similar to solutions obtained using different approaches which require more samples and are computationally more expensive.
Latent class distributional regression models represent the first method to estimate indirect non-linear reference limits from a single model fit, but the general scope of applications can be extended to other scenarios with latent heterogeneity.