Display omitted
The early diagnosis of cancer, as one of the major causes of death, is vital for cancerous patients. Diagnosing diseases in general and cancer in particular is a considerable ...application of data analysis for medical science. However, imbalanced data distribution and imbalanced quality of the majority and minority classes, which lead to misclassification, is a great challenge in this field. Though the samples of the majority class and their proper classification are more important to classifier, cancer is diagnosed by relying on the minority class samples (cancer data class). While the consequence of wrong diagnosis for non-cancerous patients is several additional clinical tests, the cancerous patients pay the price of wrong diagnosis with their lives. As such, studying the class imbalance problem is vital from the medical’s perspective. To serve this purpose, a comprehensive study on the consequences of imbalanced data problem is performed in this paper on the data of cancer patients for the first time. In this context, oversampling and under sampling as two main balancing techniques including 18 algorithms are employed. The techniques used in oversampling are ADASYN, ADOMS, AHC, Borderline-SMOTE, ROS, Safe-Level-SMOTE, SMOTE, SMOTE-ENN, SMOTE-TL, SPIDER and SPIDER2, while under sampling techniques are CNN, CNNTL, NCL, OSS, RUS, SBC and TL. To examine the impact of balancers on the performance of classifiers, four classifiers named RIPPER, MLP, KNN, and C4.5 are employed as learners. In addition, 15 cancer data sets from SEER program used for the study are kidney, soft tissue, bladder, rectum, colon, bone, larynx, breast, cervix, prostate, oropharynx, melanoma, thyroid, testis, and lip. The findings of the study are centered on examining the impact of class imbalance on the function of classifiers, a general comparing of the function of pre-processing techniques and classifying all data sets and finally determining the best balancer and classifier for each kind of cancer data set. According to the results, significant improvement is obtained through using balancers. Assessing by AUC, the performance of different classifiers of cancer imbalanced data sets has improved in 90% of the cases after using balancing techniques. To be more precise, Friedman statistical tests are applied and interestingly, each kind of cancer data set responded differently to different balancing techniques and classifiers. Moreover, considering the mean rank of each technique and classifier that were used for data sets, oversampling balancing techniques result in better outcomes than under sampling ones.
Coronavirus Disease 2019 is a pandemic that is straining healthcare resources, mainly hospital beds. Multiple risk factors of disease progression requiring hospitalization have been identified, but ...medical decision-making remains complex.
To characterize a large cohort of patients hospitalized with COVID-19, their outcomes, develop and validate a statistical model that allows individualized prediction of future hospitalization risk for a patient newly diagnosed with COVID-19.
Retrospective cohort study of patients with COVID-19 applying a least absolute shrinkage and selection operator (LASSO) logistic regression algorithm to retain the most predictive features for hospitalization risk, followed by validation in a temporally distinct patient cohort. The final model was displayed as a nomogram and programmed into an online risk calculator.
One healthcare system in Ohio and Florida.
All patients infected with SARS-CoV-2 between March 8, 2020 and June 5, 2020. Those tested before May 1 were included in the development cohort, while those tested May 1 and later comprised the validation cohort.
Demographic, clinical, social influencers of health, exposure risk, medical co-morbidities, vaccination history, presenting symptoms, medications, and laboratory values were collected on all patients, and considered in our model development.
4,536 patients tested positive for SARS-CoV-2 during the study period. Of those, 958 (21.1%) required hospitalization. By day 3 of hospitalization, 24% of patients were transferred to the intensive care unit, and around half of the remaining patients were discharged home. Ten patients died. Hospitalization risk was increased with older age, black race, male sex, former smoking history, diabetes, hypertension, chronic lung disease, poor socioeconomic status, shortness of breath, diarrhea, and certain medications (NSAIDs, immunosuppressive treatment). Hospitalization risk was reduced with prior flu vaccination. Model discrimination was excellent with an area under the curve of 0.900 (95% confidence interval of 0.886-0.914) in the development cohort, and 0.813 (0.786, 0.839) in the validation cohort. The scaled Brier score was 42.6% (95% CI 37.8%, 47.4%) in the development cohort and 25.6% (19.9%, 31.3%) in the validation cohort. Calibration was very good. The online risk calculator is freely available and found at https://riskcalc.org/COVID19Hospitalization/.
Retrospective cohort design.
Our study crystallizes published risk factors of COVID-19 progression, but also provides new data on the role of social influencers of health, race, and influenza vaccination. In a context of a pandemic and limited healthcare resources, individualized outcome prediction through this nomogram or online risk calculator can facilitate complex medical decision-making.
Cohort studies are types of observational studies in which a cohort, or a group of individuals sharing some characteristic, are followed up over time, and outcomes are measured at one or more time ...points. Cohort studies can be classified as prospective or retrospective studies, and they have several advantages and disadvantages. This article reviews the essential characteristics of cohort studies and includes recommendations on the design, statistical analysis, and reporting of cohort studies in respiratory and critical care medicine. Tools are provided for researchers and reviewers.
Machine learning methods are flexible prediction algorithms that may be more accurate than conventional regression. We compared the accuracy of different techniques for detecting clinical ...deterioration on the wards in a large, multicenter database.
Observational cohort study.
Five hospitals, from November 2008 until January 2013.
Hospitalized ward patients
None
Demographic variables, laboratory values, and vital signs were utilized in a discrete-time survival analysis framework to predict the combined outcome of cardiac arrest, intensive care unit transfer, or death. Two logistic regression models (one using linear predictor terms and a second utilizing restricted cubic splines) were compared to several different machine learning methods. The models were derived in the first 60% of the data by date and then validated in the next 40%. For model derivation, each event time window was matched to a non-event window. All models were compared to each other and to the Modified Early Warning score, a commonly cited early warning score, using the area under the receiver operating characteristic curve (AUC). A total of 269,999 patients were admitted, and 424 cardiac arrests, 13,188 intensive care unit transfers, and 2,840 deaths occurred in the study. In the validation dataset, the random forest model was the most accurate model (AUC, 0.80 95% CI, 0.80-0.80). The logistic regression model with spline predictors was more accurate than the model utilizing linear predictors (AUC, 0.77 vs 0.74; p < 0.01), and all models were more accurate than the MEWS (AUC, 0.70 95% CI, 0.70-0.70).
In this multicenter study, we found that several machine learning methods more accurately predicted clinical deterioration than logistic regression. Use of detection algorithms derived from these techniques may result in improved identification of critically ill patients on the wards.
Developing and evaluating statistical prediction models is challenging, and many pitfalls can arise. This article identifies what the authors believe are some common methodologic concerns that may be ...encountered. We describe each problem and make suggestions regarding how to address them. The hope is that this article will result in higher-quality publications of statistical prediction models.
Coronavirus disease 2019 (COVID-19) is sweeping the globe. Despite multiple case-series, actionable knowledge to tailor decision-making proactively is missing.
Can a statistical model accurately ...predict infection with COVID-19?
We developed a prospective registry of all patients tested for COVID-19 in Cleveland Clinic to create individualized risk prediction models. We focus here on the likelihood of a positive nasal or oropharyngeal COVID-19 test. A least absolute shrinkage and selection operator logistic regression algorithm was constructed that removed variables that were not contributing to the model's cross-validated concordance index. After external validation in a temporally and geographically distinct cohort, the statistical prediction model was illustrated as a nomogram and deployed in an online risk calculator.
In the development cohort, 11,672 patients fulfilled study criteria, including 818 patients (7.0%) who tested positive for COVID-19; in the validation cohort, 2295 patients fulfilled criteria, including 290 patients who tested positive for COVID-19. Male, African American, older patients, and those with known COVID-19 exposure were at higher risk of being positive for COVID-19. Risk was reduced in those who had pneumococcal polysaccharide or influenza vaccine or who were on melatonin, paroxetine, or carvedilol. Our model had favorable discrimination (c-statistic = 0.863 in the development cohort and 0.840 in the validation cohort) and calibration. We present sensitivity, specificity, negative predictive value, and positive predictive value at different prediction cutoff points. The calculator is freely available at https://riskcalc.org/COVID19.
Prediction of a COVID-19 positive test is possible and could help direct health-care resources. We demonstrate relevance of age, race, sex, and socioeconomic characteristics in COVID-19 susceptibility and suggest a potential modifying role of certain common vaccinations and drugs that have been identified in drug-repurposing studies.
Outcome prediction is a major task in clinical medicine. The standard approach to this work is to collect a variety of predictors and build a model of appropriate type. The model is a mathematical ...equation that connects the outcome of interest with the predictors. A new patient with given clinical characteristics can be predicted for outcome with this model. However, the equation describing the relationship between predictors and outcome is often complex and the computation requires software for practical use. There is another method called nomogram which is a graphical calculating device allowing an approximate graphical computation of a mathematical function. In this article, we describe how to draw nomograms for various outcomes with nomogram() function. Binary outcome is fit by logistic regression model and the outcome of interest is the probability of the event of interest. Ordinal outcome variable is also discussed. Survival analysis can be fit with parametric model to fully describe the distributions of survival time. Statistics such as the median survival time, survival probability up to a specific time point are taken as the outcome of interest.
We have seen an increasing number of studies evaluating biomarkers and prognostic factors. Biomedical researchers like to draw conclusions based on P-values. However, P-values are often not needed ...for this type of study. In this article, we show how most biomedical research problems in this area could be organized into three main analyses, each avoiding the use of P-values.
The three main analyses follow the framework of prediction modeling when the outcome of interest is binary or time-to-event. The analyses make use of figures such as boxplots, nonparametric smoothing line, and nomogram, and also incorporate prediction performance measures such as the Area under the receiver operating characteristic curve, and index of predictive accuracy.
Our proposed framework is easy to follow. It is also in line with most of the research in the field of biomarkers and prognostic factors evaluation, such as reclassification table, net reclassification index, Akaike information criterion/Bayesian information criterion, receiver operating characteristic curve, and decision curve analysis.
Overall, we provide a step-by-step guideline that biomedical researchers could easily follow to conduct statistical analysis without using P-values, especially with the goal of evaluating biomarkers and prognostic factors.
Display omitted
•Through combining the multi-objective particle swarm optimization and Random forest, a new approach is proposed to predict the heart disease.•The main goal is to produce diverse and ...accurate classifiers and determine the (near) optimal number of classifiers.•The results indicate that the proposed algorithm outperforms the other techniques in terms of accuracy and statistical tests.
Heart disease has been one of the leading causes of death worldwide in recent years. Among diagnostic methods for heart disease, angiography is one of the most common methods, but it is costly and has side effects. Given the difficulty of heart disease prediction, data mining can play an important role in predicting heart disease accurately. In this paper, by combining the multi-objective particle swarm optimization (MOPSO) and Random Forest, a new approach is proposed to predict heart disease. The main goal is to produce diverse and accurate decision trees and determine the (near) optimal number of them simultaneously. In this method, an evolutionary multi-objective approach is used instead of employing a commonly used approach, i.e., bootstrap, feature selection in the Random Forest, and random number selection of training sets. By doing so, different training sets with different samples and features for training each tree are generated. Also, the obtained solutions in Pareto-optimal fronts determine the required number of training sets to build the random forest. By doing so, the random forest's performance can be enhanced, and consequently, the prediction accuracy will be improved. The proposed method's effectiveness is investigated by comparing its performance over six heart datasets with individual and ensemble classifiers. The results suggest that the proposed method with the (near) optimal number of classifiers outperforms the random forest algorithm with different classifiers.