Super Learner van der Laan, Mark J.; Polley, Eric C; Hubbard, Alan E.
Statistical Applications in Genetics and Molecular Biology,
9/2007, Letnik:
6, Številka:
1
Journal Article
Recenzirano
When trying to learn a model for the prediction of an outcome given a set of covariates, a statistician has many estimation procedures in their toolbox. A few examples of these candidate learners ...are: least squares, least angle regression, random forests, and spline regression. Previous articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007)) theoretically validated the use of cross validation to select an optimal learner among many candidate learners. Motivated by this use of cross validation, we propose a new prediction method for creating a weighted combination of many candidate learners to build the super learner. This article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold cross-validation to select weights to combine an initial set of candidate learners. In addition, this paper contains a practical demonstration of the adaptivity of this so called super learner to various true data generating distributions. This approach for construction of a super learner generalizes to any parameter which can be defined as a minimizer of a loss function.
As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional ...statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.
Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.
This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.
The frailty index (FI) is one way in which frailty can be quantified. While it is measured as a continuous variable, various cut-off points have been used to categorise older adults as frail or ...non-frail, and these have largely been validated in the acute care or community settings for older adults without cancer. This review aimed to explore which FI categories have been applied to older adults with cancer and to determine why these categories were selected by study authors.
This scoping review searched Medline, EMBASE, Cochrane, CINAHL, and Web of Science databases for studies which measured and categorised an FI in adults with cancer. Of the 1994 screened, 41 were eligible for inclusion. Data including oncological setting, FI categories, and the references or rationale for categorisation were extracted and analysed.
The FI score used to categorise participants as frail ranged from 0.06 to 0.35, with 0.35 being the most frequently used, followed by 0.25 and 0.20. The rationale for FI categories was provided in most studies but was not always relevant. Three of the included studies using an FI > 0.35 to define frailty were frequently referenced as the rationale for subsequent studies, however, the original rationale for this categorisation was unclear. Few studies sought to determine or validate optimum FI categorises in this population.
There is significant variability in how studies have categorised the FI in older adults with cancer. An FI ≥ 0.35 to categorise frailty was used most frequently, however an FI in this range has often represented at least moderate to severe frailty in other highly-cited studies. These findings contrast with a scoping review of highly-cited studies categorising FI in older adults without cancer, where an FI ≥ 0.25 was most common. Maintaining the FI as a continuous variable is likely to be beneficial until further validation studies determine optimum FI categories in this population. Differences in how the FI has been categorised, and indeed how older adults have been labelled as 'frail', limits our ability to synthesise results and to understand the impact of frailty in cancer care.
Cell proliferation must be coordinated with cell fate specification during development, yet interactions among pathways that control these two critical aspects of development are not well understood. ...The coordination of cell fate specification and proliferation is particularly crucial during early germline development, when it impacts the establishment of stem/progenitor cell populations and ultimately the production of gametes. In C. elegans, insulin/IGF-like receptor (IIR) signaling has been implicated in fertility, but the basis for the fertility defect had not been previously characterized. We found that IIR signaling is required for robust larval germline proliferation, separate from its well-characterized role in preventing dauer entry. IIR signaling stimulates the larval germline cell cycle. This activity is distinct from Notch signaling, occurs in a predominantly germline-autonomous manner, and responds to somatic activity of ins-3 and ins-33, genes that encode putative insulin-like ligands. IIR signaling in this role acts through the canonical PI3K pathway, inhibiting DAF-16/FOXO. However, signaling from these ligands does not inhibit daf-16 in neurons nor in the intestine, two tissues previously implicated in other IIR roles. Our data are consistent with a model in which: (1) under replete reproductive conditions, the larval germline responds to insulin signaling to ensure robust germline proliferation that builds up the germline stem cell population; and (2) distinct insulin-like ligands contribute to different phenotypes by acting on IIR signaling in different tissues.
The effects of weather on diarrhea could influence the health impacts of climate change. Children have the highest diarrhea incidence, especially in India, where many lack safe water and sanitation.
...In a prospective cohort of 1,284 children under 5 y of age from 900 households across 25 villages in rural Tamil Nadu, India, we examined whether high temperature and heavy rainfall was associated with increased all-cause diarrhea and water contamination.
Seven-day prevalence of diarrhea was assessed monthly for up to 12 visits from January 2008 to April 2009, and hydrogen sulfide (Formula: see text) presence in drinking water, a fecal contamination indicator, was tested in a subset of households. We estimated associations between temperature and rainfall exposures and diarrhea and Formula: see text using binomial regressions, adjusting for potential confounders, random effects for village, and autoregressive-1 error terms for study week.
There were 259 cases of diarrhea. The prevalence of diarrhea during the 7 d before visits was 2.95 times higher (95% CI: 1.99, 4.39) when mean temperature in the week before the 7-d recall was in the hottest versus the coolest quartile of weekly mean temperature during 1 December 2007 to 15 April 2009. Diarrhea prevalence was 1.50 times higher when the 3 weeks before the diarrhea recall period included Formula: see text (vs. 0 d) with rainfall of Formula: see text (95% CI: 1.12, 2.02), and 2.60 times higher (95% CI: 1.55, 4.36) for heavy rain weeks following a 60-d dry period. The Formula: see text prevalence in household water was not associated with heavy rain prior to sample collection.
The results suggest that, in rural Tamil Nadu, heavy rainfall may wash pathogens that accumulate during dry weather into child contact. Higher temperatures were positively associated with diarrhea 1-3 weeks later. Our findings suggest that diarrhea morbidity could worsen under climate change without interventions to reduce enteric pathogen transmission through multiple pathways. https://doi.org/10.1289/EHP3711.
In this work we introduce the personalized online super learner (POSL), an online personalizable ensemble machine learning algorithm for streaming data. POSL optimizes predictions with respect to ...baseline covariates, so personalization can vary from completely individualized, that is, optimization with respect to subject ID, to many individuals, that is, optimization with respect to common baseline covariates. As an online algorithm, POSL learns in real time. As a super learner, POSL is grounded in statistical optimality theory and can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed/offline algorithms that are not updated during POSL's fitting procedure, pooled algorithms that learn from many individuals' time series, and individualized algorithms that learn from within a single time series. POSL's ensembling of the candidates can depend on the amount of data collected, the stationarity of the time series, and the mutual characteristics of a group of time series. Depending on the underlying data‐generating process and the information available in the data, POSL is able to adapt to learning across samples, through time, or both. For a range of simulations that reflect realistic forecasting scenarios and in a medical application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for both short and long time series, and it's able to adjust to changing data‐generating environments. We further cultivate POSL's practicality by extending it to settings where time series dynamically enter and exit.
The contribution of cerebral small vessel disease (cSVD) to the pathogenesis of frailty remains uncertain. We aimed to examine the associations between cSVD with progression of frailty in a ...population-based study of older people.
People aged between 60 and 85 years were randomly selected form the electoral roll to participate in the Tasmanian Study of Cognition and Gait. Participants underwent self-reported questionnaires, objective gait, cognitive and sensorimotor testing over three phases ranging between 2005 and 2012. These data were used to calculate a 41-item frailty index (FI) at three time points. Baseline brain magnetic resonance imaging was performed on all participants to measure cSVD. Generalized mixed models were used to examine associations between baseline cSVD and progression of frailty, adjusted for confounders of age, sex, level of education, and total intracranial volume.
At baseline (n = 388) mean age was 72 years (SD = 7.0), 44% were female, and the median FI score was 0.20 (interquartile range IQR 0.12, 0.27). In fully adjusted models higher burden of baseline white matter hyperintensity (WMH) was associated with frailty progression over 4.4 years (β = 0.03, 95% CI: 0.01, 0.05; p = .004) independent of other SVD markers. Neither baseline infarcts (p = .23), nor microbleeds at baseline (p = .65) were associated with progression of frailty.
We provide evidence for an association between baseline WMHs and progression of frailty. Our findings add to a growing body of literature suggesting WMH is a marker for frailty.
Background: multidisciplinary rehabilitation is of proven benefit in the management of older inpatients. However, the identification of patients who will do well with rehabilitation currently lacks a ...strong evidence base.
Objectives: the aims of this study were to compare the importance of chorological age, gender, co-morbidities and frailty in the prediction of adverse outcomes for patients admitted to an acute geriatric rehabilitation ward.
Design: prospective observational cohort study.
Subjects and setting: two hundred and sixty-five patients admitted consecutively to an acute geriatric rehabilitation ward at a tertiary care teaching hospital.
Methods: frailty status was measured by an index of accumulated deficits, giving a potential score from 0 (no deficits) to 1.0 (all 40 deficits present). Patients were stratified into three outcomes: good (discharged to original residence within 28 days), intermediate (discharged to original residence but longer hospital stay) and poor (newly institutionalised or died).
Results: patients were old (82.6 ± 8.6 years) and frail (mean frailty index (FI) 0.34 ± 0.09). Frailty status correlated significantly with length of stay and was a predictor of poor functional gain. The odds ratio of intermediate and poor outcome relative to a good outcome was 4.95 (95% CI = 3.21, 7.59; P < 0.001) per unit increase in FI. Chronological age, gender and co-morbidity showed no significant association with outcomes.
Conclusion: frailty is associated with adverse rehabilitation outcomes. The FI may have clinical utility, augmenting clinical judgement in the management of older inpatients.
Computational methods and tools are a powerful complementary approach to experimental work for studying regulatory interactions in living cells and systems. We demonstrate the use of formal reasoning ...methods as applied to the Caenorhabditis elegans germ line, which is an accessible system for stem cell research. The dynamics of the underlying genetic networks and their potential regulatory interactions are key for understanding mechanisms that control cellular decision-making between stem cells and differentiation. We model the “stem cell fate” versus entry into the “meiotic development” pathway decision circuit in the young adult germ line based on an extensive study of published experimental data and known/hypothesized genetic interactions. We apply a formal reasoning framework to derive predictive networks for control of differentiation. Using this approach we simultaneously specify many possible scenarios and experiments together with potential genetic interactions, and synthesize genetic networks consistent with all encoded experimental observations. In silico analysis of knock-down and overexpression experiments within our model recapitulate published phenotypes of mutant animals and can be applied to make predictions on cellular decision-making. A methodological contribution of this work is demonstrating how to effectively model within a formal reasoning framework a complex genetic network with a wealth of known experimental data and constraints. We provide a summary of the steps we have found useful for the development and analysis of this model and can potentially be applicable to other genetic networks. This work also lays a foundation for developing realistic whole tissue models of the C. elegans germ line where each cell in the model will execute a synthesized genetic network.